Most current Internet information systems suffer three types of problems:
Harvest provides a very efficient means of gathering and distributing indexing information; supports the easy construction of many different types of indexes customized to suit the peculiarities of each information collection; and provides caching and replication support to alleviate bottlenecks.
We discuss these three issues below. A table of contents of each of the subsystems (Broker, Cache, Gatherer, Replicator) is also available.
Harvest's gathering efficiency derives from a combination of optimized gathering software and a flexible scheme for sharing gathered information among indexes that need it. A Harvest Gatherer collects indexing information, while a Broker provides an incrementally indexed query interface to the gathered information. Gatherers and Brokers communicate using an attribute-value stream protocol called the Summary Object Interchange Format (SOIF), and can be arranged in various ways to achieve flexible and efficient use of the network and servers:
In comparison with previous systems, our measurements indicate that Harvest can reduce FTP/HTTP/Gopher server load by a factor of 4 while extracting indexing information and 6,600 while delivering this information to remote indexers; network traffic by a factor of 59; and index space requirements by a factor of 43.
Harvest's flexibility for constructing many types of indexes derives from its customizable subsystems, which allow index builders to control how information is extracted, indexed, searched, and accessed. A summary of customizations is available here.
To alleviate bottlenecks that arise for accessing popular data and servers, Harvest replicates indexes and caches retrieved objects. The Replication subsystem can also be used to divide the gathering process among many servers (e.g., letting one server index each U.S. regional network), distributing the partial updates among the replicas.
Harvest consists of the following subsystems:
These subsystems and their corresponding measurements are discussed in more detail in a paper about Harvest.
The Gatherer provides an efficient and customizable way to collect indexing information. We discuss each issue below.
Most current indexing systems generate excessive load on remote sites because they retrieve indexing information via Gopher/HTTP/FTP - requiring heavyweight operations like forking separate processes to retrieve each object. Current indexing systems also cause excess network traffic, because they retrieve entire objects, only to discard most of the information once it reaches the indexing site (for example, retaining only HTML anchors in an index).
The Harvest Gatherer addresses these inefficiencies through Provider site-resident software optimized for indexing. The Gatherer scans objects periodically, maintains a cache of indexing information, and allows a Provider's indexing information to be retrieved in a single stream (rather than requiring separate requests for each object). It addresses network transmission inefficiencies by providing remote access to pre-filtered, incrementally updatable indexing information, and transmitting this information in a compressed stream. These features reduce server and network load by several orders of magnitude each when building indexes.
The Gatherer addresses flexibility through the Essence customizable information extraction system. Essence can unnest presentation layer encodings (such as compressed "tar" files) into constituent files, recognize each file, and extract information in different ways depending on the file types. For example, it can find author and title lines in Latex documents, and symbols in object code. More importantly, it allows users to customize the type recognition, unnesting, selection, and extraction phases for use in particular situations.
The Broker provides an indexed query interface to gathered information. Brokers retrieve information from one or more Gatherers or other Brokers, and incrementally update their indexes. The Broker records unique identifiers and time-to-live's for each indexed object, garbage collects old information, and invokes the Index/Search Subsystem when it receives a update or query.
The Broker can collect objects directly from another Broker using a bulk transfer protocol. It also supports a remote administration interface.
Harvest provides a distinguished Broker instance called the Harvest Server Registry (HSR), which registers information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet. The HSR is useful when constructing new Gatherers and Brokers, to avoid duplication of effort. It can also be used when looking for an appropriate Broker at search time, and to locate Caches and Replicators.
To accommodate diverse indexing and searching needs, Harvest defines a general Broker-Indexer interface that can accommodate a variety of search engines. The principal requirements are that the backend support Boolean combinations of attribute-based queries, and that it support incremental updates. One can therefore use a variety of different backends inside a Broker. We currently support commercial and free WAIS, Glimpse, and Nebula. Other possible future engines include Ingres and an audio/image/video search system.
We have developed two particular Index/Search subsystems for Harvest, each optimized for different uses. Glimpse supports space-efficient indexes and flexible interactive queries, while Nebula supports fast searches and complex standing queries. At present, only Glimpse is used in the Harvest demonstration Brokers and distributed with the Harvest source distribution. However, you can learn more about Nebula here.
Glimpse uses pointers to adjustable-sized occurrence blocks, achieving very space efficient indexes (typically 2-4% the size of the data being indexed, compared with 100% in the case of WAIS). It also supports fast and incremental indexing. Moreover, is supports Boolean, regular expression, and approximate queries.
With Nebula, each object is represented as a set of attribute/value pairs. There is a separate index (and possibly mechanism) per attribute tag. Nebula also supports a view notion involving a standing query against object base. This allows information to be filtered based on query predicates. It is easy to refine and extend the predicates over time, and observe changes interactively.
Harvest provides a weakly consistent, replicated wide-area file system called mirror-d, on top of which Brokers are replicated. Mirror-d itself is layered atop a hierarchical, flooding update-based group communication subsystem called flood-d.
Each mirror-d instance in a replication group occasionally floods complete state information to its immediate neighbors, to detect updates that flood-d failed to deliver, possibly due to a long-lasting network partion, site failure, or failure of a flood-d process. Mirror-d implements eventual consistency: if all new updates ceased, the replicas eventually converge.
Flood-d logically floods objects from group member to group member along a graph managed by flood-d itself. Each flood-d instance measures the end-to-end network bandwidth achievable between itself and other flood-daemons running at other group member sites. A master site within each group constructs and reliably distributes either a two-connected or a three-connected, low diameter, logical topology of the group members.
A flood-daemon can belong to many groups, making it possible to construct hierarchies of groups and to stitch otherwise independent groups together by sharing two or three common members.
More information about the Harvest Replication subsystem is available here.
To meet ever increasing demand on network links and information servers, Harvest includes a hierarchical Object Cache.
The Cache sends "query" datagrams to each neighbor and parent, plus an ICMP echo to object's home site, and chooses the fastest responding server from which to retrieve the data. It caches Gopher, HTTP, and FTP objects, plus recent DNS name-to-address maps. It runs as a single, event-driven process. I/O to disk and to Cache clients is non-blocking. I/O between Cache clients begins as soon as the first few bytes of an object fault their way into the Cache. For ease of implementation, the Cache spawns a separate process to retrieve FTP files, but retrieves HTTP and Gopher objects itself. The Cache separately manages replacement of objects on disk and objects loaded in its virtual address space. It also keeps all meta-data for Cached objects in virtual memory, to eliminate access latency to the meta-data. Since the Cache is single threaded but otherwise non-blocking, page faults are the only source of blocking.
The cache uses TTL-based coherence. Clients request objects through the the proxy-HTTP interface.
Our measurements indicate that the cache runs approximately twice as fast as the CERN object cache, and the distribution of object retrieval times has a much shorter "tail" than CERN's cache.
Our work on the Object Cache subsystem was motivated in part by a simulation study we performed using NSFNET backbone trace data. You can also look at the Object Cache Object Cache Home Page.
While the WWW provides a popular environment for sharing multimedia information, it presently provides limited support for more complex data processing. For example, Mosaic cannot easily be used to share and manipulate scientific data. At the server side, a user could create a forms-based interface to a scientific database, but this approach limits interactions to a sequence of selections and textual inputs. Alternatively, a user can define a MIME type for the data and install some data type-specific software to handle the data at the client end. However, this approach is not dynamically extensible. Each new type of data requires modifying the browser's configuration files and installing local software, and requires general consensus on the names of data types (e.g., postscript, jpeg, etc).
To address this problem, we defined an object-oriented information access protocol and implemented an associated set of system components. This Harvest Object System (HOS) allows users to type data and associate method code with the data, which is automatically invoked when a user selects a hypertext link to the data. While many object-oriented systems have been built in the past, most were designed to support sophisticated language or system features that restricted their runtime environment. A key design goal of the current effort was that it should be possible for people to deploy legacy data and associated processing software on the Web with a minimum of programming and administrative effort, while retaining the essential power of the object model.
HOS consists of the Harvest Object Protocol, an object browser that is invoked from Mosaic when it encounters a Harvest object, a type/method registration system, and a runtime system that interprets the registered interface description and causes object data and code to be moved across the network as needed. This system allows objects to be executed either remotely (as network servers) or locally (by first fetching the code from the remote site).
We will make HOS available around December 1994. We are quite interested in getting users to export their data using this system.
Harvest supports customizations in each of its subsystems:
Return to the Harvest Home Page.