http://harvest.cs.colorado.edu/harvest/demobrokers.html (World Wide Web Directory, 06/1995)
Harvest Demonstration Brokers
Harvest Demonstration Brokers
Harvest index servers are called Brokers. Demonstration Brokers
are accessible via the hypertext links below. A
terse list
(skipping the background about what each Broker demonstrates) is also available.
You also can
add URLs to these
Brokers.
-
The
University of Colorado CS Department index lets users search for
documents in many different formats, available through the department's
WWW, FTP, and NetNews servers. It demonstrates how a Broker can be
customized to support queries of interest to a user community, and how a
Broker can be customized to use local site document naming and
organization conventions to provide high-quality indexing terms for
documents. (See the
sample queries for that Broker).
-
The SEC
EDGAR index lets users search forms that have been filed with the
Securities and Exchange Commission in 1995. It demonstrates how a
Broker can make use of SGML-tagged data to provide a powerful
search service, with query fields based on the tagged data.
-
Query
Access to the AT&T 1-800 Telephone Directory. This Broker was gathered
from AT&T's 800 Web pages, which
currently only support browsing by category or
name. With a modest amount of effort we able to build:
-
a customized Harvest Gatherer that collected the 4,000+ Web pages that hold
these data and extracted categories according to the particular data format
being used
-
a Broker that lets users browse by category,
search by category, search by business name, and search by telephone
number, including support for misspellings.
In addition to using this Broker, users can retrieve the indexing
data we gathered in a single compressed stream of object
summaries, and construct their own indexes of the data without incurring
the additional server and network load needed to gather the data
themselves. For example, while it took about 10 hours to gather the
data from across the Internet, it only takes a few mintues to retrieve
the compressed summary stream for these data from across the Internet.
To learn more about doing this, see our help
screen about finding Harvest servers with the Harvest Server
Registry (HSR), and search the HSR for
GATHERER AND "telephone directory"
(note that quotes are significant).
-
WWW Home Pages. Here we used Harvest to create an index of over
21,500 WWW home pages. Because we index content summaries rather than
just anchor and HTML strings, this index captures much of the content of
Web sites without having to collect every last Web page - providing a
useful index at lower cost and much less duplication of information than
that found in the World Wide Web
Worm or Lycos (TM).
-
PC
Software. This index demonstrates Harvest's ability to incorporate
information in a variety of formats from other sources, including high
quality, manually-generated information sources. Because each indexed
site uses a somewhat different format, we used Harvest's customizable
extraction features to collect indexing information in site-specific
ways, and place this information into a uniform format. As a result of
this effort, we were quickly able to incorporate high quality indexing
information about nearly 30,000 publically available PC software
distributions. This index provides better search support than more
general-purpose software indexes (such as Archie),
because it contains conceptual descriptions of a focused collection of
information. For example, searching for "batch programming language"
will locate the "RAP" package, while Archie could only locate this
object if you searched for "RAP".
We have also built Gatherer translation scripts for some other manually
created indexing information formats, including the ``Linux Software
Map'' (LSM) format and the the Internet Anonymous FTP Archives IETF
Working Group (IAFA) format. At present we have a Broker runing for LSM
data but none for IAFA data, because there are not yet enough sites
using the IAFA format to warrant building a Broker.
-
Computer Science technical reports. This index covers content
summaries of over 24,000 reports from 300 sites, published in a variety
of formats (ASCII, PostScript, DVI, HTML, etc.). Content summaries
support more powerful searches than the titles/abstracts covered by
previously existing CS technical report indexes (such as those offered
by Monash
University, Indiana
University, and the University of
Karlsruhe). The current index is possible because Harvest provides
a very space-efficient indexing architecture.
-
Networked Information Discovery and Retrieval (NIDR) software
and documents,
and a software +
documents index built by cascading the separate indexes into a
combined index (at no additional network or server load). These indexes
underscore the scaling advantages of topic-specific indexing. For
example, the query ``approximate'' will locate agrep (an approximate
match tool embedded in Harvest's indexing system), while the same query
at our more general Computer Science technical reports index (below)
locates many unrelated papers.
-
Documents referencing the
Santa
Fe Institute time series competition data.
This index demonstrates Harvest's
structured indexing capability and its
indexing customizability:
in addition to supporting the usual content summary index,
the SFI time series broker allows users to search by time series reference.
These references were
generated by a corpus-specific script attached to the indexing process that
matches each document content
summary against approximately 70 regular expressions, to
heuristically determine the referenced time series.
-
The
NetNews
index demonstrates how Harvest can work with a rapidly
changing database such as network news. Rather than indexing individual
messages, here we only use the newsgroup ``overviews.'' An overview is a
list of the subject, sender, message-id, and other information for each
message in a newsgroup. This allows us to create a good index of news
articles without needing to retrieve each article individually.
Unfortunately, an article's subject line often does not reflect its
content. Currently this broker contains only newsgroups from the 'comp'
hierarchy. The database is updated daily.
You can also browse and search the list of available Harvest servers
(including instances of Gatherers, Brokers, Object Caches, and
Replication Managers) by contacting the Harvest
Server Registry.
Return to the Harvest Home Page.