Database Schema Overview
The data in the database is sourced from multiple web crawls performed by the Internet Archive. A webcrawl is a set of data collected by a webcrawler. A webcrawler starts with a set of web page addresses (URLs) and visits each in turn. It then visits the pages that these seeds link to, and so on. The Internet Archive's webcrawls were taken at two-month intervals, and each took several weeks to perform - so a webcrawl is not an instantaneous "snapshot" of the world wide web.
The schema is based on the notions of URL, Page and Link. A URL is the web page address, such as "http://www.google.com/intl/en/about.html". A Page is whatever data the crawler received when it tried to visit a URL at a particular time (this time is recorded as ArchiveTime in the database.) This can be the text of a web page, an image, an error message or other file. If the crawler received a web page (and not, for example, an image), it will looks through (parse) this page to see any web addresses (URLs) that this page links to. These are recorded as Links in the database.
To recap: the database consists of URLs (web page addresses), Pages (data received when the crawler visited a given URL at a specific time), and Links (URLs linked to by a Page).
It is important to understand that this schema is asymmetrical: given a Page we can easily find what URLs it links to, but given a URL we cannot easily find what URLs it links to. This is because a URL cannot itself "link to" another URL. A URL can be used to find a Page, which is a snapshot of that URL at a specific time, and then we can find what URLs that Page links to. This is of particular concern for applications like graph analysis where we wish to generate a graph of links between URLs. This is not an easy problem and involves many decisions and trade-offs.
As well as the main Link, Page and URL tables (outlined above). There are the following:
- Host Table. This is used in conjunction with the URL table to store URLs. The reason for using it is to reduce the duplication of data. Since there may be many URLs starting with the same host, the decision was made to assign each host a numerical identifier. Both the Host and URL tables need to be queried to retrieve a complete URL.
- Crawl Table. This is a simple table consisting of one row for each web crawl in the database. Each crawl has a numerical identifier and a textual name, related to the identifier used by the Internet Archive.
- UrlFullText, PageFullText, LinkFullText, HostFullText tables. All of the textual fields in the URL, Page, Link and Host tables are limited to at most 900 characters in length. This is for performance reasons (MSSQL will not index longer text.) In some rare cases, the data may contain fields longer than this limit. In this case, a null is stored in the URL, Page, Link or Host table, and the full text is stored in the corresponding field in the corresponding FullText table.
Detailed Database Schema
Click on a table below to retrieve detailed information.
UrlFullText Table Description
Overflow text for url (either path, file extension, or query string).
|Column Name||Column Type||Length||Column Description|
|UrlID||binary||16||Hash of the full url (w/o protocol or port)|
|HostID||binary||16||Hash of the hostname from the url (e.g. 'www.cs.cornell.edu')|
|Path||text||16||Full text of the path of the url not including the filename extension (e.g. '/courses/cs430/index')|
|Extension||text||16||Full text of the filename extension (e.g. '.html' or '.asp')|
|QueryString||text||16||Full text after the ? (e.g. '?fileid=7&download=US')|