Web Lab status
Last updated April 2008
Non-Cornell Researchers
The goal of the Web Lab is to provide research access for all approved academic and scientific researchers. At present, however, we are only able to support Cornell researchers. This is for a combination of technical and legal reasons.
Collections and the Web Lab database
Web Crawls
For most of its history, the Internet Archive collected web data in discrete crawls. The page data is stored in compressed form in large files, called ARC files. Until recently, each ARC file had an associated file of metadata called a DAT file. See the documentation for the ARC and DAT formats.
The individual crawls are distinguished by two letter codes, such as EB, which is a crawl from the early part of 2005. Recently the Internet Archive has moved towards continuous crawling. See http://www.weblab.infosci.cornell.edu/tools/crawl_sizes_statistics for a partial list of crawls and basic statistics about them. This page also shows which crawls the Web Lab has downloaded.
The Web Lab database
The Web Lab database is a relational database. It is divided into collections all of which have the same schema. The schema provides metadata about pages, links, URLs, hosts, etc. See the documentation pages for the details of the schema.
There is a separate collection for each complete crawl that has been loaded into the database. In addition there are special collections that are available for research. See http://www.weblab.infosci.cornell.edu/tools/database_sizes_statistics for a partial list of the collections and basic statistics about them. In addition to the collections shown on this page, researchers can generate sub-collections and temporary collections. See the documentation for GetPages for further information.
Profiles of the Web Lab collections
Profiles are available for several of the collections and sub-collections. For page frequency distributions by domain, see: http://www.weblab.infosci.cornell.edu/tools/page_frequency_distributions.
Data Analysis and the Hadoop cluster
There are three ways to carry out data analysis on the Web Lab collections
The Hadoop cluster
The Web Lab operates a cluster of Linux computers running the Hadoop file system and configured for MapReduce programming tasks. When a planned upgrade is completed in summer 2008, this cluster will consist of approximately 60 nodes, each with 4 processors and 8 GB of memory. The total disk capacity will be 270 TB, though the net space available will be substantially less. Anybody who wishes to use this cluster should contact Lucy Walle (lwalle@cac.cornell.edu).
The Web Site
This web site, http://www.weblab.infosci.cornell.edu/, provides a steadily increasing set of tools for small scale data analysis. The primary set of tools enable the researcher to create a temporary collection from the Web Lab database and to download it for analysis. See the Researcher's Area for information about what tools are currently available. Researchers need to register through the web site before using these tools.
The Database server
Because of the size of the collections, large database tasks can run for many hours or even days. For this reason, very few people are authenticated to run these jobs directly. If you wish to extract a subset or carry out an analysis that cannot be done through the web site, contact Manuel Calimlim (calimlim@cs.cornell.edu).
