Not logged in [ Register for account ] [ Login ]  
Cornell University

Heritrix for the Web Lab: a Tutorial

Table of Contents

Introduction

Installation

How to Crawl the Web Lab

Known Bugs

Extend Weblab

Appendix A - Wayback Machine Modifications to Page Content

Introduction to HeritrixWeblab

HeritrixWeblab is a customized version of an open-source, industry strength, Java crawler called Heritrix, which has been developed and maintained by the folks at The Internet Archive. The customizations in HeritrixWeblab allow Heritrix to crawl the web as archived in The Web Lab databases. Moreover, if the contents of a page are not available at the Web Lab, HeritrixWeblab can fetch the missing content through Internet Archive's Wayback Machine.

HeritrixWeblab was built on top of Heritrix version 1.10.0. Thus, the term HeritrixWeblab refers to the collection of this core of Heritrix and customizations added on top of it through the packages heritrixWeblab 1.0 and weblabWS 1.0.1.

In essence, HeritrixWeblab is a tool designed to allow researchers from various academic fields to gain access to the valuable data being collected by The Web Lab at Cornell University.

 

How to install HeritrixWeblab

There are two types of HeritrixWeblab installations, each catering to a different set of users. Common prerequisites to these different types of installations are:

Binary Installation

This section is geared toward users who simply want to get a binary version of  Heritrix along with the customizations made for The Web Lab. The binary version of HeritrixWeblab can be downloaded from here. Simply untar the downloaded file into your desired directory and set the the variable  $HERITRIX_HOME to this installation directory.

Source Installation

This type of installation is for users who want to build HeritrixWeblab from source to possibly extend its capabilities. The source for HeritrixWeblab, heritrix-1.10.0,  can be downloaded from here. It already contains the libraries and modifications necessary to run Web Lab customizations. However, you can also download the source of the customization packages namely, weblabWS-1.0.1 and heritrixWeblab-1.0 from the same place.  This document will use the variable  $HERITRIX_SRC_HOME to point to the installation directory of the source package heritrix-1.10.0. To build HeritrixWeblab, you will need to install the following additional software:

Apache Maven 1.0.2 (Declare a variable $MAVEN_102_HOME pointing to its installation directory)

Building Web Lab Heritrix Customization Packages (Optional)

Typically, this step can be skipped unless one wants to change the code written for customization of Heritrix core. Download the two Web Lab customization packages weblabWS-1.0.1 and heritrixWeblab-1.0 as mentioned above.  Also install the following version of Maven, in addition to the version installed above.

Apache Maven  2.0.4 (Declare a variable $MAVEN_204_HOME pointing to its installation directory)

Why is a separate version of Maven needed? Because the core Heritrix code was developed  using the older Maven 1.0.2 release, while the customizations were developed using Maven 2.

Now follow the following commands to build and install these packages.

cd /path to/weblabWS-1.0.1

$MAVEN_204_HOME/bin/mvn install

cd /path to/heritrixWeblab-1.0

$MAVEN_204_HOME/bin/mvn install

cp /path to/weblabWS-1.0.1/target/weblabWS-1.0.1.jar $HERITRIX_SRC_HOME/lib

cp /path to/heritrixWeblab-1.0/target/heritrixWeblab-1.0.jar $HERITRIX_SRC_HOME/lib

Building Heritrix (Required)

 Execute the following commands to build Heritrix:

cd $HERITRIX_SRC_HOME

$MAVEN_102_HOME/bin/maven dist

The build process will now start. Go grab a cup of tea; this process can take up to 20 minutes. By the time this process completes a directory $HERITRIX_SRC_HOME/target/heritrix-1.10.0 will be created. Create a variable $HERITRIX_HOME  and make it point to this directory. 

 

How to Crawl the Weblab Using HeritrixWeblab

This section of the tutorial assumes that you have followed the directions in the installation section, and installed HeritrixWeblab on your machine.

Starting HeritrixWeblab

To start HeritrixWeblab, enter the following command in a shell:

$HERITRIX_HOME/bin/heritrix –admin=admin:letmein

where:

$HERITRIX_HOME = Path to HeritrixWeblab installation directory.

 admin:letmein = The username and password for this installation which default to the shown values.

After a few moments, you should see the following output in your shell:

Sat Nov 25 17:53:17 EST 2006 Starting heritrix.....

Heritrix 1.10.0 is running.

Web console is at: http://127.0.0.1:8080

Web console login and password: admin/letmein

This output indicates that HeritrixWeblab was succesfully started on port 8080 of your local machine.

Changing HeritrixWeblab Port

By default, HeritrixWeblab starts on port 8080. If the default port is not acceptable, it can be changed by editing the heritrix.cmdline.port property in $HERITRIX_HOME/conf/heritrix.properties.

Interacting with HeritrixWeblab Web Interface

With HeritrixWeblab successfully started, you can point your browser to the URL http://localhost:8080/login.jsp to see the HeritrixWeblab web interface. The following page should be visible to you:

weblabLoginPage


Insert the appropriate user name and password in these fields. The user name and password entered here, are the same as those entered in the previous step. Click on Login. This should open up the HeritrixWeblab administration console:

heritrixAdministrationConosole


Administration Console

The HeritrixWeblab administration console is divided into the the following views. A detailed description of each of these views can be obtained from The Heritrix User Manual.

Console

The console view contains information about the currently running crawl. It lists basic information such as the number of URLs fetched, total memory consumption and total time used. It also shows a progress bar for a running crawl that shows the progress in terms of the currently scheduled URLs in the crawl frontier.

Jobs

A Job in Heritrix is synonymous to a crawl. Thus, the Jobs view lets one create and manage new jobs.

Profile

Jobs can be created based upon Profiles. A profile allows one to set defaults e.g. the seed list, link extraction processors etc. that would apply to all jobs based on that profile. The Profile view lets one create and manage profiles.

Logs

Each crawl job may generate log files which can be viewed in the Logs view.

Report

The Report views contain high level reports on the status of a crawl. These include reports on fetching of seeds, progress statistics etc.

Settings

The Settings view lets one fine tune the parameters set for a given crawl on a per host basis.

Help

This view is rather self explanatory.

Creating your first Job

To create your first crawl job, click on the Job view in the Administration Console.  As a result, you should see the following web page:

HeritrixJobConsole


The Create New Job section offers various ways of creating a new crawl job. Pending Jobs, on the other hand, lists jobs that have been created but not started. Lastly, Completed Jobs lists jobs that have been completed along with various options for each.

A description of the various ways of creating a new job can be seen in the Crawl Job section of Heritrix user manual.  For now lets focus on creating a job from scratch by clicking on the With defaults link. The following page should become visible:

HeritrixNewJob


The name of a crawl and its description together define a human readable interface to a crawl job. The Seeds text box takes the crawl seed URLs, one per line. For the purpose of this tutorial, lets choose the following values:

seeds


Selecting Job Modules

Now click on the Modules link to further refine this job. The following page shows up:

modules

 

This page shows a list of processors that can apply on each URL during the course of a crawl.  The major modules visible include

Crawl Scope

Crawl Scope defines the conditions under which a URL is considered within the scope of a crawl. A URL is within the crawl scope if it should be fetched for that crawl. By default, HeritrixWeblab chooses org.archive.crawler.deciderules.DecidingScope which makes the scoping decisions based on a list of rules to be discussed under 'Decide Rules'. For the purpose of this job, lets keep the default choice of HeritrixWeblab. If we wanted to override this default, we could choose a substitute from the drop down menu associated with the Crawl Scope Module and click on Change.

Frontier

The Frontier module provides a persistent way of keeping track of the URLs scheduled to be visited in the crawl. Initially this module contains only the seeds of the crawl, but as the crawl proceeds newly discovered URLs are added to this module. HeritrixWeblab, by default, chooses org.archive.crawler.frontier.BdbFrontier which is a reasonable choice. The default could be overridden using the drop down menu of the Frontier.

Processor Chains

Heritrix employs the concept of processor chains, each containing a list of processors that perform the same logical operation in distinct ways. The major processor chains in Heritrix are:

Pre-Processor Chain

Processors in this chain handle a URL before its contents are fetched. These processors can be thought of as performing last minute checks to ensure that a URL is within the crawl scope, and that its contents can be fetched. The default Heritrix processors in this chain include org.archive.crawler.prefetch.Preselector  and  org.archive.crawler.prefetch.PreconditionEnforcer. A description of these processors can be obtained by clicking the Info link next to their entries in the Pre-Processor chain. For the purposes of HeritrixWebLab, we do not need the PreconditionEnforcer. This is because, we are fetching page contents from The Web Lab web service and hence, do not need the DNS fetches PreconditionEnforcer enforces on the crawl. In fact, our crawl cannot proceed without removing this processor, therefore, go ahead an remove it from the Pre-Processor chain by clicking on Remove next to its name.

Fetch Chain

This chain of processors is responsible for fetching a URL's content. By default, Heritrix adds org.archive.crawler.fetcher.FetchDNS and org.archive.crawler.fetcher.FetchHTTP to this chain with the former ensuring DNS lookups, while the latter fetching URL page content from the world wide web (WWW). Since at the Web Lab, we are primarily interested in the URLs stored in our databases, therefore, we do not need to fetch URL page content from the web. Remove this processor from the Fetch Chain by clicking Remove next to it.  With the FetchHTTP removed, lets add the HeritrixWeblab custom fetcher to fetch page content from The Web Lab. From the drop down menu associated with the Fetch Chain, choose edu.cornell.tc.weblab.heritrix.fetcher.FetchWebLab and click on Add. This should result in the following page.

 

AddedFetchweblab

 

The Up and Down buttons allow one to change the position of a processor within a processor chain. The higher a processor in this chain, the earlier it gets a chance to process a URL. There is another custom HeritrixWeblab fetcher that may be of interest to users. If a URL was found in the Weblab database but its corresponding content was not, the content can be fetched from the Internet Archive's Wayback Machine. To see how this works, lets add the processor edu.cornell.tc.weblab.heritrix.fetcher.FetchWayBackMachine to the Fetch processor chain by selecting it from the drop down menu and clicking on Add. Notice that this new processor gets added to the end of the Fetch processor chain. In this case, the order of processors is very important because the FetchWaybackMachine processor only fetches pages that were found in the the Weblab databases. This determination of URL existence in Weblab database is done by FetchWebLab, therefore, it must precede FetchWaybackMachine for the latter to perform its task.

Extractor Chain

This chain of processors is responsible for extracting outlinks from the page content fetched for each page. By default, HeritrixWeblab adds a handful of extractors, each one of which is designed to extract links from a different type of content. You can get details on each by clicking on Info next to the processor of interest. For the purpose of Weblab, the outlinks of a page are stored in Weblab database. In other words, we are not interested in harvesting them out of fetched page content.  Therefore, lets remove the default extractors, one at a time, by clicking on Remove and add the custom HeritrixWeblab extractor edu.cornell.tc.weblab.heritrix.extractor.ExtractorWebLab from the drop down menu. The resulting page should look like this:

 

AddExtractorWebLab

 

Writers Chain

This chain of processors is responsible for writing the fetched contents and outlinks of a URL to file. The default writer, org.archive.crawler.writer.ARCWriterProcessor, does not write the outlinks of a URL to file, therefore, lets replace it with HeritrixWeblab's custom writer: edu.cornell.tc.weblab.heritrix.writer.WeblabExperimentalWARCWriterProcessor. This processor writes outlinks of a URL to file even if no page content was retrieved for it, which is not supported by the default Heritrix writer.

Post-Processor Chain

The Post-Processor chain includes processors that process every URL that passes the Pre-Processor Chain. In other words, even if there was an HTTP 404 error in fetching the contents of a URL, due to which the extractor chain is not run on it, the post-processor chain still acts on that URL. By default, It includes processors that conditionally add the outlinks of a URL to the frontier. The default processors are sufficient for the purpose of this crawl.

Statistics Processor

This processor is responsible for keeping track of the progress statistics of a crawl. This is an optional but useful processor, therefore lets accept the default values.

Selecting Job Submodules

Click on the Submodules tab to arrive upon the following page:

submodules

 

The Submodules page allows the addition of further parameters to the modules selected earlier. More specifically, notice that the very first section on this page allows us to select the Decide Rules to apply in the Deciding Scope we chose in the previous section. Remember, decide rules together determine whether a URL is within the scope of a crawl. A handful of useful decide rules are already included by default. You can find out more about the functionality of each by clicking on the '?' following each rule. For our purposes, we can try something simple. Logically speaking, lets choose a set of rules that would allow us to accept any URL in the crawl except those that are more than 'x' hops away from the crawl seed set. This can be done by removing all the default rules and adding the following ones from the drop down menu:

org.archive.crawler.deciderules.AcceptDecideRule

org.archive.crawler.deciderules.TooManyHopsDecideRule

Notice that the order of rules is important. If we were to flip this order than the AcceptDecideRule will override any rejections made by TooManyHopsDecideRule. The page after these changes should look like:

 

decideRules

 

If you scroll down the page, you will see further options for customizing the performance for each module. For example, you can see the default URL canonicalization rules that are applied on each URL by the frontier. For the purpose of this tutorial, the default values are acceptable for the remaining modules.

Refining Job Settings

 

Click on the Settings tab to dig deeper into customizing the crawl settings. The following page should now be visible:

 

settings

 

Most of the fields have acceptable default values. Lets customize a few as follows:

Mozilla/5.0 (compatible; heritrix/1.10.0 +http://www.infosci.cornell.edu/SIN/WebLab/).

Note that the from field should be set to the email address of the crawl operator.

Submitting the Job

The Refinements and Override views allow an even detailed customization of the crawl job, however, that is beyond the scope of this tutorial.  We are now ready to submit this job for crawling. We can do that by clicking the Submit Job tab near the top or bottom of the current page. This results in the following page:

 

jobCreatedSuccesfully

 

Notice that the job we created has shown up under the Pending Jobs section of Jobs page. We can create multiple crawls and submit them all to the Pending Jobs section before we start any of them. However, for now, lets stick with a single crawl.

Starting a Job

To start our newly created job, switch to the Console view of the HeritrixWeblab administration console by clicking on the Console tab. The following page should show:

 

pending

 

where the highlighting was added to show our newly created job. To start crawling, click on the the Start link next to HOLDING JOBS.  This should kick start the crawling process. If you click on the Refresh link near the bottom of this page, you may see a progress bar like:

 

runningJob

 

Let the crawl run its course; it should complete in a few minutes. Once the crawl completes, switch to the Job view and you will see our finished job along with logs and reports generated for it.

 

finished

 

Analyzing a Crawl

Heritrix provides a number of ways of analyzing a crawl. From the Jobs page, one can navigate to the logs generated by the crawl. One can also view the final crawl reports generated. Each of these reports and logs are quite self-explanatory, therefore, they will not be discussed here. One can consult the Heritrix User Manual for more details.

Inspecting the WARC file

The WeblabExperimentalWARCWriterProcessor added in the Writer processor chain generates WARC files of the fetched page content and farmed outlinks. For the crawl we just ran, these WARC files are available as gzipped files in the directory $HERITRIX_HOME/jobs/tutorial-20061126003346885/warcs/. Note that the timestamp following the name tutorial will be different for each crawl.

 

Known Bugs in HeritrixWeblab

Following is a list of known bugs in the HeritrixWebLab code:

 

error

 

How to Extend HeritrixWeblab

The section on installing HeritrixWeblab from source is a prerequisite to understanding how to extend HeritrixWeblab.

While HeritrixWeblab comes pre-packaged with a basic framework to allow crawling of The Web Lab databases, its true potential can be realized only by customizing it further to the end user's need. Any such extension will have to depend at the very least upon:

The resulting package's jar file can be dropped in the directory $HERITRIX_SRC_HOME/lib Moreover, to have Heritrix recognize your additions, appropriate files will have to be edited in $HERITRIX_SRC_HOME/src/conf/ directory. For example, say you create a new fetcher called edu.univ.dept.heritrix.fetcher.FetchSpecial. You will have to add an entry for it in $HERITRIX_SRC_HOME/src/conf/modules/Processor.options, and rebuild HeritrixWeblab as described here , to have your additions reflected.

Appendix A - Removing Modifications from Wayback Machine Replies

As noted above, the fetcher edu.cornell.tc.weblab.fetcher.FetchWayBackMachine fetches the content of a URL from the Internet Archive's Wayback Machine, if the URL was found in the Weblab database, but its corresponding content was not. The Wayback Machine makes a number of modifications to the actual content of the URL that are relevant to its domain. Since a researcher would probably like to have access to the original content of a page, therefore, the FetchWayBackMachine processor removes the following modifications made by the Wayback Machine: