RBSE's URL database

http://rbse.jsc.nasa.gov/eichmann/urlsearch.html (World Wide Web Directory, ~04/1995)

RBSE's URL database

This index is a collection of url references built up and indexed with a hacked version of WAIS. The index is constructed by a spider that walks the web, building a graph in an Oracle database, and WAIS indexing the full text of the document. There are currently 36,195 documents in the index.

Here are some stats concerning the spider's graph:

                 Distinct  Distinct    Total
  Date     Time   Sources   Targets    Edges    Notes
=======  =======  =======   =======   =======   =====
2/20/94   9:50AM   13,082    24,421   103,417
          5:00PM   13,789    33,715   118,930
          9:30PM   14,490    37,981   128,541
2/21/94   8:30AM   16,690    48,341   162,226
         11:15AM   17,278    53,803   171,957 
         12:15PM   17,617    62,397   182,880

Note that this is only a snapshot of a portion of the web - effectively a five level breadth-first probe from our home page. (It's not a complete probe because I ran out of tablespace in Oracle...) The index was constructed using the source html documents in the graph and the target documents in the graph that were identifiably html - patterns of the form "*.html" or "http:*/" (i.e., links that are using http as their protocol and pointing at default pages for directories).

This service offered as part of the experimental prototype under construction by the Repository Based Software Engineering project.

Pointers to other Spiders

Koster's list of beasties

Papers and Presentations on Spiders

eichmann@rbse.jsc.nasa.gov