http://harvest.cs.colorado.edu/harvest/technical.html (World Wide Web Directory, 06/1995)

Technical Discussion of the Harvest System

Most current Internet information systems suffer three types of problems:

It's difficult to find relevant information. To address this problem many people have built indexing systems (such as WWW robots), but most of these systems place significant unnecessary load on information servers and network links, because they use expensive object retrieval protocols to gather indexing information, and they do not coordinate information gathering among themselves. Each indexer gathers all of the information it needs, without trying to share overlapping information with other indexers.
Most systems offer little support for customizing how they handle different information formats and index/search schemes.
Internet data and indexes often become very popular, and cause serious network and server bottlenecks.

Harvest provides a very efficient means of gathering and distributing indexing information; supports the easy construction of many different types of indexes customized to suit the peculiarities of each information collection; and provides caching and replication support to alleviate bottlenecks.

We discuss these three issues below. A table of contents of each of the subsystems (Broker, Cache, Gatherer, Replicator) is also available.

Gathering Efficiency

Harvest's gathering efficiency derives from a combination of optimized gathering software and a flexible scheme for sharing gathered information among indexes that need it. A Harvest Gatherer collects indexing information, while a Broker provides an incrementally indexed query interface to the gathered information. Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF), and can be arranged in various ways to achieve flexible and efficient use of the network and servers:

A Gatherer can access an information Provider from across the network using the native FTP, Gopher, or HTTP protocols. This arrangement is primarily useful for interoperating with systems that do not run the Harvest software. It experiences the same server and network inefficiencies as discussed above, although the gatherered information can be shared by many Brokers in this arrangement.
A Gatherer can be run at the Provider site, saving a great deal of server load and network traffic.
A Broker can collect information from many Gatherers, to build an index of widely distributed information.
A Gatherer can feed information to many Brokers, saving repeated gathering costs.
Brokers can retrieve information from other Brokers, in effect cascading indexed views from one another, using the Broker's query interface to filter/refine the information from one Broker to the next.

In comparison with previous systems, our measurements indicate that Harvest can reduce FTP/HTTP/Gopher server load by a factor of 4 while extracting indexing information and 6,600 while delivering this information to remote indexers; network traffic by a factor of 59; and index space requirements by a factor of 43.

Flexibility

Harvest's flexibility for constructing many types of indexes derives from its customizable subsystems, which allow index builders to control how information is extracted, indexed, searched, and accessed. A summary of customizations is available here.

Alleviating Bottlenecks

To alleviate bottlenecks that arise for accessing popular data and servers, Harvest replicates indexes and caches retrieved objects. The Replication subsystem can also be used to divide the gathering process among many servers (e.g., letting one server index each U.S. regional network), distributing the partial updates among the replicas.

Detailed Discussion of Harvest Subsystems

Harvest consists of the following subsystems:

Gatherer
Broker
Index/Search Subsystem
Replicator
Object Cache
Harvest Object System

These subsystems and their corresponding measurements are discussed in more detail in a paper about Harvest.

Gatherer

The Gatherer provides an efficient and customizable way to collect indexing information. We discuss each issue below.

Most current indexing systems generate excessive load on remote sites because they retrieve indexing information via Gopher/HTTP/FTP - requiring heavyweight operations like forking separate processes to retrieve each object. Current indexing systems also cause excess network traffic, because they retrieve entire objects, only to discard most of the information once it reaches the indexing site (for example, retaining only HTML anchors in an index).

The Harvest Gatherer addresses these inefficiencies through Provider site-resident software optimized for indexing. The Gatherer scans objects periodically, maintains a cache of indexing information, and allows a Provider's indexing information to be retrieved in a single stream (rather than requiring separate requests for each object). It addresses network transmission inefficiencies by providing remote access to pre-filtered, incrementally updatable indexing information, and transmitting this information in a compressed stream. These features reduce server and network load by several orders of magnitude each when building indexes.

The Gatherer addresses flexibility through the Essence customizable information extraction system. Essence can unnest presentation layer encodings (such as compressed "tar" files) into constituent files, recognize each file, and extract information in different ways depending on the file types. For example, it can find author and title lines in Latex documents, and symbols in object code. More importantly, it allows users to customize the type recognition, unnesting, selection, and extraction phases for use in particular situations.

Broker

The Broker provides an indexed query interface to gathered information. Brokers retrieve information from one or more Gatherers or other Brokers, and incrementally update their indexes. The Broker records unique identifiers and time-to-live's for each indexed object, garbage collects old information, and invokes the Index/Search Subsystem when it receives a update or query.

The Broker can collect objects directly from another Broker using a bulk transfer protocol. It also supports a remote administration interface.

Harvest provides a distinguished Broker instance called the Harvest Server Registry (HSR), which registers information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet. The HSR is useful when constructing new Gatherers and Brokers, to avoid duplication of effort. It can also be used when looking for an appropriate Broker at search time, and to locate Caches and Replicators.

Index/Search Subsystem

To accommodate diverse indexing and searching needs, Harvest defines a general Broker-Indexer interface that can accommodate a variety of search engines. The principal requirements are that the backend support Boolean combinations of attribute-based queries, and that it support incremental updates. One can therefore use a variety of different backends inside a Broker. We currently support commercial and free WAIS, Glimpse, and Nebula. Other possible future engines include Ingres and an audio/image/video search system.

We have developed two particular Index/Search subsystems for Harvest, each optimized for different uses. Glimpse supports space-efficient indexes and flexible interactive queries, while Nebula supports fast searches and complex standing queries. At present, only Glimpse is used in the Harvest demonstration Brokers and distributed with the Harvest source distribution. However, you can learn more about Nebula here.

Glimpse uses pointers to adjustable-sized occurrence blocks, achieving very space efficient indexes (typically 2-4% the size of the data being indexed, compared with 100% in the case of WAIS). It also supports fast and incremental indexing. Moreover, is supports Boolean, regular expression, and approximate queries.

With Nebula, each object is represented as a set of attribute/value pairs. There is a separate index (and possibly mechanism) per attribute tag. Nebula also supports a view notion involving a standing query against object base. This allows information to be filtered based on query predicates. It is easy to refine and extend the predicates over time, and observe changes interactively.

Replicator

Harvest provides a weakly consistent, replicated wide-area file system called mirror-d, on top of which Brokers are replicated. Mirror-d itself is layered atop a hierarchical, flooding update-based group communication subsystem called flood-d.

Each mirror-d instance in a replication group occasionally floods complete state information to its immediate neighbors, to detect updates that flood-d failed to deliver, possibly due to a long-lasting network partion, site failure, or failure of a flood-d process. Mirror-d implements eventual consistency: if all new updates ceased, the replicas eventually converge.

Flood-d logically floods objects from group member to group member along a graph managed by flood-d itself. Each flood-d instance measures the end-to-end network bandwidth achievable between itself and other flood-daemons running at other group member sites. A master site within each group constructs and reliably distributes either a two-connected or a three-connected, low diameter, logical topology of the group members.

A flood-daemon can belong to many groups, making it possible to construct hierarchies of groups and to stitch otherwise independent groups together by sharing two or three common members.

More information about the Harvest Replication subsystem is available here.

Object Cache

To meet ever increasing demand on network links and information servers, Harvest includes a hierarchical Object Cache.

The Cache sends "query" datagrams to each neighbor and parent, plus an ICMP echo to object's home site, and chooses the fastest responding server from which to retrieve the data. It caches Gopher, HTTP, and FTP objects, plus recent DNS name-to-address maps. It runs as a single, event-driven process. I/O to disk and to Cache clients is non-blocking. I/O between Cache clients begins as soon as the first few bytes of an object fault their way into the Cache. For ease of implementation, the Cache spawns a separate process to retrieve FTP files, but retrieves HTTP and Gopher objects itself. The Cache separately manages replacement of objects on disk and objects loaded in its virtual address space. It also keeps all meta-data for Cached objects in virtual memory, to eliminate access latency to the meta-data. Since the Cache is single threaded but otherwise non-blocking, page faults are the only source of blocking.

The cache uses TTL-based coherence. Clients request objects through the the proxy-HTTP interface.

Our measurements indicate that the cache runs approximately twice as fast as the CERN object cache, and the distribution of object retrieval times has a much shorter "tail" than CERN's cache.

Our work on the Object Cache subsystem was motivated in part by a simulation study we performed using NSFNET backbone trace data. You can also look at the Object Cache Object Cache Home Page.

Harvest Object System

While the WWW provides a popular environment for sharing multimedia information, it presently provides limited support for more complex data processing. For example, Mosaic cannot easily be used to share and manipulate scientific data. At the server side, a user could create a forms-based interface to a scientific database, but this approach limits interactions to a sequence of selections and textual inputs. Alternatively, a user can define a MIME type for the data and install some data type-specific software to handle the data at the client end. However, this approach is not dynamically extensible. Each new type of data requires modifying the browser's configuration files and installing local software, and requires general consensus on the names of data types (e.g., postscript, jpeg, etc).

To address this problem, we defined an object-oriented information access protocol and implemented an associated set of system components. This Harvest Object System (HOS) allows users to type data and associate method code with the data, which is automatically invoked when a user selects a hypertext link to the data. While many object-oriented systems have been built in the past, most were designed to support sophisticated language or system features that restricted their runtime environment. A key design goal of the current effort was that it should be possible for people to deploy legacy data and associated processing software on the Web with a minimum of programming and administrative effort, while retaining the essential power of the object model.

HOS consists of the Harvest Object Protocol, an object browser that is invoked from Mosaic when it encounters a Harvest object, a type/method registration system, and a runtime system that interprets the registered interface description and causes object data and code to be moved across the network as needed. This system allows objects to be executed either remotely (as network servers) or locally (by first fetching the code from the remote site).

We will make HOS available around December 1994. We are quite interested in getting users to export their data using this system.

Summary of Harvest Customizations

Harvest supports customizations in each of its subsystems:

The Gatherer provides a content extraction system that allows users to customize what data are gathered; whether data are gathered locally (which is more efficient but requires site cooperation) or remotely (which allows data to be gatherered via the standard FTP/Gopher/HTTP protocols); how data are typed; what types of presentation-layer encodings (such as compression and network data representations) are handled; what data will be extracted into content summaries; and how data are extracted.
The Broker supports an indexer-independent interface that can be customized to include new indexers with minimal effort. Indexer-specific extensions can be added to the administrative interface through "#index" flags and through indexer configuration variables. The Broker can also be configured to expire and re-collect information at varying intervals from the specified Gatherers.
While different Index/Search Subsystems can be plugged into each Broker, we currently provide the Glimpse subsystem with Harvest. Glimpse is quite flexible, allowing users to adjust the tradeoff between search time and index space; to perform case-sensitive or insensitive queries; to perform boolean combinations of keywords; to perform regular expression pattern matching; to perform unstructured or attribute structured queries (which allow you to constrain matches to certain fields); to match parts of words, whole words, or multiple word phrases (e.g., "fiber optics"); to specify an allowable number of spelling errors; to limit the number of returned matches; to order matches by rank; and to show matched lines or entire matching records (e.g., for citations).
The Replicator allows sites to customize the degree of replication, topology of updates, and the frequency of updates.
The Object Cache allows sites to customize hierarchical relationships between caches at multiple levels of the network (e.g., at a campus, regional, and backbone network); timeouts, caching, and negative caching policies; the maximum size of objects to be cached for each object type; and the amount of RAM and disk used by the cache.
The Harvest Object System allows users to define type hierarchies and associated access methods for Web objects. These extensions enable Mosaic and the Web to represent, manipulate and display arbitrarily complex data in application-specific ways. Types and access methods are extensible; new ones can be defined and exported from Web servers and imported into Mosaic on demand, without changes to Mosaic and HTML for each new data type.

Return to the Harvest Home Page.