The Web Laboratory: Documentation
Data base schema
- Database schema. The main database and the smaller collections (such as the Amazon collection) use the same schema.
Data base access tools
- GetPages. This tool is used to retrieve a subset of the data in a Web Lab database, by specifying which fields to included in the data, and restrictions which are used to select which data to retrieve.
- GetCrawls. This tool shows you contents of the database's 'Crawl' table.
Internet Archive Crawler
- WebBack Crawler script This is a tool that can crawl both the live web and the Internet Archive. After crawling, the script can analyze different aspects of the crawl, as well as output metadata to a CSV file.
Analysis tools
- HeritrixWebLab. This is a customized version of the Heritrix open-source, web crawler.
- VizTool. A simple tool for interactively exploring the web graph of a selected set of seed pages.
File formats
- ARC/DAT. Raw data as received from the Internet Archive is in the ARC and DAT formats.
- TSV files. The WebLab TSV (Tab-Separated Values) file format is used by the tools that save some subset of the data from the WebLab database to a file.
