$Id: Release-Notes-1.3.txt,v 1.14 1995/09/05 21:00:00 duane Exp $
TABLE OF CONTENTS
1. Gatherer
IP-based filtering
Username/Passwords
Post-Summarizing
Cache directory cleanup
Limit on retrieval size
Support for HTML-3.0, Netscape, and HotJava DTDs
2. Broker
Brokers.cf
Glimpse 3.0
Verity/Topic
WAIS, Inc.
Displaying SOIF attributes in results
Uniqify duplicate objects
Glimpse inline queries
3. Cache
Persistent disk cache
Common logfile format
Improved Internet protocol support
TTL calculation by regexp
Improved customizability
Security
Performance Enhancements
Portability
Optional Code
4. Miscellaneous
Admin scripts
========================================================================
GATHERER
IP-based filtering
------------------
It is now possible to use an IP network address in a Host
filter file. The IP address is matched using regular
expressions. This means that periods must be escaped. For
example:
Allow 128\.196\..*
Deny .*
Username/Passwords
------------------
It is now possible to gather password-protected documents from
HTTP and FTP servers. In both cases, it is possible to specify
a username and password as a part of the URL. The format is
ftp://user:password@host:port/url-path
http://user:password@host:port/url-path
With this format, the "user:password" part is kept as a part
of the URL string all throughout Harvest. This may enable
anyone who uses your Broker(s) to access password-protected
pages.
It is also possible to have "hidden" username and password
information. These are specified in the gatherer.cf file.
For HTTP, the format is
HTTP-Basic-Auth: realm username password
'realm' is the same as the 'AuthName' parameter given in an
NCSA .htaccess file. In the CERN HTTP configuration, the
realm value is called 'ServerId.'
For FTP, the format in the gatherer.cf file is
FTP-Auth: hostname[:port] username password
Post-Summarizing
----------------
It is now possible to "fine-tune" the summary information
generated by the Essence summarizers. A typical application of
this would be to change the 'Time-to-live' attribute based on
some knowledge about the objects. So an administrator could
use the post-summarizing feature to give quickly-changing
objects a lower TTL, and very stable documents a higher TTL.
Objects are selected for post-processing if they meet a
specified condition. A condition consists of three parts: An
attribute name, an operation, and some string data. For
example:
city == 'New York'
In this case we are checking if the 'city' attribute is equal to
the string 'New York' The for exact string matching, the string
data must be enclosed in single quotes. Regular expressions
are also supported:
city ~ /New York/
Negative operators are also supported:
city != 'New York'
city !~ /New York/
Conditions can be joined with '&&' (logical and) or '||' (logical or)
operators:
city == 'New York' && $state != 'NY';
When all conditions are met for an object, some number of
instructions are executed on it. There are four types of
instructions which can be specified:
1. Set an attribute exactly to some specific string
Example:
time-to-live = "86400"
2. Filter an attribute through some program. The attribute
value is given as input to the filter. The output of the
filter becomes the new attribute value.
Example:
keywords | tr A-Z a-z
3. Filter multiple attributes through some program. In this
case the filter must read and write attributes in the
SOIF format.
Example:
address,city,state,zip ! cleanup-address.pl
4. A special case instruction is to delete an object. To do
this, simply write
delete()
The conditions and instructions are combined together in a
"rules" file. The format of this file is somewhat similar to a
Makefile; conditions begin in the first column and instructions
are indented by a tab-stop. Example:
type == 'HTML'
partial-text | cleanup-html-text.pl
URL ~ /users/
time-to-live = "86400"
partial-text ! extract-owner.sh
type == 'SOIFStream'
delete()
This rules file is specified in the gatherer.cf file with the
Post-Summarizing: tag, e.g.:
Post-Summarizing: lib/myrules
Cache directory cleanup
-----------------------
The gatherer uses a local disk cache of objects it has
retrieved. These objects are stored in the tmp/cache-liburl
subdirectory. Prior to v1.3 this cache directory was left in
place after the gatherer completed. This caused confusion and
problems when users re-ran the gatherer and expected to see new
or changed objects appear.
Now the default behaviour is to remove the cache-liburl
directory after the gatherer completes successfully. Users who
want to leave this directory in place will need to add
Keep-Cache: yes
to their gatherer.cf file.
Limit on retrieval size
-----------------------
The code for retrieving FTP, HTTP, and Gopher objects now stops
transferring after 10M bytes. This is to prevent bogus URL's
from filling up local disk space. This limit can currently
only be changed by modifying the source in src/common/url (look
for "MAX_TRANSFER_SIZE").
Support for HTML-3.0, Netscape, and HotJava DTDs
------------------------------------------------
DTDs for HTML-3.0, Netscape, and HotJava have been added to
the collection in lib/gatherer/sgmls-lib/HTML/. To take advantage
of these DTDs your HTML pages should begin with one of:
========================================================================
BROKER
Brokers.cf
----------
Prompted by security concerns, there is a change in the way
that BrokerQuery.pl.cgi connects with a broker. The old method
had the broker hostname and port number passed as CGI
arguments. The new way passes the broker short name instead.
This name is then looked up in the file
$HARVEST_HOME/brokers/Brokers.cf. The CreateBroker program
will add the correct entry to Brokers.cf.
The old method still works for backwards compatibility. With
the new method, the broker name must appear in the Brokers.cf
file. If it does not, the user receives an error message.
The Brokers.cf file may also provide interesting features such as
* quickly relocating brokers to other machines
* using dual brokers for 24hr/day availability
If you change your broker port number (in admin/broker.conf)
then don't forget to change it here as well.
Glimpse 3.0
-----------
Harvest now uses Glimpse 3.0 which includes a number of bugfixes
and performance improvements:
* A new data structure considerably speeds up queries
on large indexes. Typical queries now take less
than one second, even for very large indexes.
* Incremental indexing is now fully supported.
* The on-disk indexing structures have been improved in
several ways. As a result, indexes from previous
versions are incompatible. When upgrading to this
release, you should remove all .glimpse_* files
in the broker directory before restarting the broker.
* Glimpse can now handle more than 64k objects in the
broker.
Verity/Topic
------------
This release includes support for using Verity Inc.'s Topic
indexing engine with the broker. In order to use Topic with
Harvest, a license must be purchased from Verity (see
http://www.verity.com/).
At this point, Harvest does not make use of all features in
the Topic engine. However, does include a number of features
that make it attractive:
* Background indexing: the broker will continue to
service requests as new objects are added to the
database.
* Matched lines (or Highlights): lines containing query
terms are displayed with the result set.
* Result set ranking
* Flexible query operations such as proximity, stemming,
and thesaurus.
WAIS, Inc.
----------
This release includes support for using WAIS Inc.'s commercial
WAIS indexing engine with the broker. To use commercial WAIS
with Harvest, a license must be purchased from WAIS Inc. (see
http://www.wais.com/). The WAIS/Harvest combination offers
the following features:
* Structured queries (not available with Free WAIS).
* Incremental indexing
* Result set ranking
* Use of native WAIS operators, e.g. ADJ to find one
word adjacent to another.
Displaying SOIF attributes in results
-------------------------------------
In v1.2 the Broker allowed specific attributes from matched
objects to be returned in the result set. However, there
was no real support for this in BrokerQuery.pl.cgi.
Now it is possible to request SOIF attributes with the use
of HTML FORM facilities. A simple approach is to include
a select list in the query form. For example:
In this manner, the user may control which attributes are
displayed. The layout of these attributes in HTML is
controlled by the '' specification in
$HARVEST_HOME/cgi-bin/lib/BrokerQuery.cf.
Uniqify duplicate objects
-------------------------
Occasionally a broker may end up with duplicate entries for
individual URLs. This usually happens when the Gatherer
changes (its description, hostname, or port number). To remedy
this situation, there is a "uniqify" command on the broker
interface. On the admin.html page it is described as "Delete
older objects of duplicate URLs." When two objects with the
same URL are found, the object with the least-recent timestamp
is removed.
Glimpse inline queries
----------------------
In v1.2 using Glimpse with the broker required the broker to
fork a 'glimpse' process for every query. Now the broker can
make the query directly to the 'glimpseserver'. If glimpseserver
is disabled or not running for some reason, the broker will use
the previous approach and spawn a glimpse process to handle the
query.
========================================================================
CACHE
Persistent disk cache
---------------------
Upon startup the cache now "reloads" cached objects from a
previous session. While this adds some delay at startup,
heavily used sites will benefit, especially where filling
the cache with popular objects is expensive or time-consuming.
To disable the persistent disk cache, add the '-z' flag to
cached's command line. This emulates the previous behaviour,
which is to remove all previously cached objects at startup.
Common logfile format
---------------------
The cache now supports the httpd common logfile format which is
used by many HTTP server implementations. This makes the
cache's access logfile compatible with many of the freely
available logfile analyzers. Note that the cache does not
(yet) log the object size for requests which result in a
'TCP_MISS'.
There have been many improvements to the debugging output
as well.
Improved Internet protocol support
----------------------------------
Numerous improvements and bugfixes have been made to HTTP,
FTP, and Gopher protocol implementations. Additionally,
a user-contributed patch for proxying to WAIS servers has
been included.
TTL calculation by regexp
-------------------------
It is now possible to have the cache calculate time-to-live
values based on URL regular expressions. This would allow
an administrator to set large TTL's for images and lower
TTL's for text, for example.
These are specified in the cached.conf file, beginning with
the tag 'ttl_pattern'. For example:
ttl_pattern ^http:// 1440 20% 43200
The second field is a POSIX-style regular expression. Invalid
expressions are ignored.
The third value is an absolute time-to-live, given in minutes.
This value is ignored if negative. A zero value indicates that
an object matching the pattern should not be cached. NOTE: the
absolute TTL is used only if the percent-of-age (described
next) is not used.
The fourth value is a percent-of-age factor. If the object is
sent with valid Last-Modification timestamp information, then
the object's TTL is calculated as
TTL = (current-time - last-modified) * percent-of-age / 100;
If the percent-of-age field is zero, or a last-modification
timestamp is not present, then the algorithm looks at the
absolute TTL value next.
The fifth field is a maximum, upper-bound on the TTL to return
for the percent-of-age method. It is specified in minutes,
with the default being 30 days. This is provided in case a
buggy server implementation returns ridiculous last-modification
data.
Improved customizability
------------------------
More options have been added to the cache configuration file:
* String-based stoplist to deny caching of objects
which contain the stoplist string (e.g.: "cgi-bin").
* Support for "quick aborting." When the client drops
a connection, the cache will abort the data transfer
immediately. Useful for caches behind SLIP/PPP
connections.
* The number of DNS lookup servers is now configurable.
The default is three.
* The trace mail message sent to cs.colorado.edu
(containing only the IP address and port number of
your cache) can now be turned off.
Security
--------
IP-based access controls are now supported. The administrator
may deny access to specific IP networks/hosts, or may only
allow access from specific networks/hosts. Two access control
lists are maintained: one for clients/browsers using the cache
(the "ascii port") and another for the remote instrumentation
interface (cache manager).
Performance Enhancements
------------------------
Several performance enhancements have been made to the cache:
* The LRU replacement algorithm is more efficient and
quicker. In conjunction with the new LRU replacement
policy the default low water mark has been changed
from 80% to 60%.
* The in-memory usage (metadata) of cached objects has
been reduced to 80-100 bytes per object.
* The retrieval of various statistics from the instrumentation
interface is much faster.
* User-configurable garbage collection reduces the number
of times these more expensive operations are performed.
* Cleaned up and reduced overall memory management. Our
checks with Purify report no memory leaks.
Portability
-----------
The TCL libraries are no longer needed to compile the cache.
User-contributed patches have been incorporated for better
support on BSD, Linix, IRIX, and HP-UX systems.
Optional Code
-------------
The following are recent additions to the code. They can be
optionally included by setting '-D' flags in the Makefile.
CHECK_LOCAL_NETS
Define this to optimize retrievals from servers on your
local network. If your cache is configured with a parent,
objects from your local servers may be pulled through the
parent cache. To always retrieve local objects directly
define CHECK_LOCAL_NETS and rebuild the source code. Then
add your local IP network addresses to the cache configuration
file with the 'local_ip' directive. For example:
local_ip 128.138.0.0
local_ip 192.54.50.0
LOG_FQDN
Client IP addresses are logged in the access log file. To
log the fully qualified domain name instead, define LOG_FQDN
and rebuild the code. WARNING: This is not implemented
efficiently and may adversely affect your cache performance.
Before each line is written to the access log file, a call
to gethostbyaddr(3) is made. This library call may block
an arbitrary amount of time while waiting for a reply from
a DNS server. While this function blocks, the cache will
not be able to process any other requests. You have been warned.
APPEND_DOMAIN
Define this and use the 'append_domain' configuration directive
to append a domainname to hostnames without any domain
information.
USE_WAIS_RELAY
Define this and use the `wais_relay' configuration directive
to allow WAIS queries to be cached and proxied.
========================================================================
MISCELLANEOUS
Admin scripts
-------------
A number of sample scripts are provided to aid in administering
your Harvest installation:
RunGatherers.sh: This script can be run from your ``/etc/rc''
scripts to start the Harvest gatherer daemons at boot time.
It must be customized with the directory names of your gatherers.
It is installed in $HARVEST_HOME/lib/gatherer.
RunBrokers.sh: This script can be run from your ``/etc/rc''
scripts to start the Harvest brokers at boot time.
It must be customized with the directory names of your brokers.
It is installed in $HARVEST_HOME/lib/broker.
harvest-check.pl: This Perl script is designed to be run
occasionally as a cron(1) job. It will contact your gatherers
and brokers and report on any which seem to be unreachable.
The list of gatherers and brokers to contact can be specified
at the end of the script.