Not logged in [ Register for account ] [ Login ]  
Cornell University

The Web Laboratory: Overview

Since 1996, the Internet Archive has collected snapshots of the Web, approximately every other month.  At the end of 2007, the collection had more than 110 billion Web pages.  This data is mounted on servers in San Francisco. The Wayback Machine is an open accessible service that returns all the pages that match a specified URL. 

The Cornell Web Lab is a project to provide in-depth access for researchers who wish to analyze this collection in greater depth than is possible with the Wayback Machine.  The overall system is shown in the figure. 

system

  • The data is divided into collections of Web pages.  A collection can be a complete Web crawl or a specialized collection.  Structural metadata about each collection is stored in a relational database on a dedicated server.  The database has records for each page, link between pages, and URL.
  • Most research projects begin by selecting a set of web pages for detailed analysis to create a working collection.  This selection can be carried out from this Web site, or by special request to the database administrator.  The working collection is then downloaded to the computer system that will be used for data analysis.  The Web Lab has a dedicated cluster Linux cluster, for data analysis.  Alternatively, the working collection can be downloaded to any computer on the network. 
  • The data analysis cluster also provides storage for the actual Web pages. It runs the Hadoop distributed file system.
  • Data analysis tools are based on the map/reduce programming paradigm, which is a central component of the Hadoop system.  The Web Lab is steadily building up a set of tools and expertise in using map/reduce for large scale data analysis
  • The Web Lab also provides a substantial file server.  This is used for downloading data from the Internet Archive and a s working storage for research projects.
  • Finally, the Web Lab is connected to the TeraGrid, which provides a very high bandwidth connection to the NSF supercomputing centers.

For current information about each of these components, see the Researchers' Area.