Heritrix for the Web Lab: a Tutorial
Table of Contents
Appendix A - Wayback Machine Modifications to Page Content
Introduction to HeritrixWeblab
HeritrixWeblab is a customized version of an open-source, industry strength, Java crawler called Heritrix, which has been developed and maintained by the folks at The Internet Archive. The customizations in HeritrixWeblab allow Heritrix to crawl the web as archived in The Web Lab databases. Moreover, if the contents of a page are not available at the Web Lab, HeritrixWeblab can fetch the missing content through Internet Archive's Wayback Machine.
HeritrixWeblab was built on top of Heritrix version 1.10.0. Thus, the term HeritrixWeblab refers to the collection of this core of Heritrix and customizations added on top of it through the packages heritrixWeblab 1.0 and weblabWS 1.0.1.
In essence, HeritrixWeblab is a tool designed to allow researchers from various academic fields to gain access to the valuable data being collected by The Web Lab at Cornell University.
How to install HeritrixWeblab
There are two types of HeritrixWeblab installations, each catering to a different set of users. Common prerequisites to these different types of installations are:
- The underlying Heritrix code was built and tested on the Linux Operating System. Therefore, this constraint transfers over to HeritrixWeblab as well. The
uname -aoutput of the development machine is : Linux lion.cs.cornell.edu 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux - HeritrixWeblab requires that Java 5 be installed on the target machine and the following variables be set:
$JAVA_HOME = path to Java installation directory.$JAVA_OPTS="-Xmx512M"
Binary Installation
This section is geared toward users who simply want to get a binary version of Heritrix along with the customizations made for The Web Lab. The binary version of HeritrixWeblab can be downloaded from here. Simply untar the downloaded file into your desired directory and set the the variable $HERITRIX_HOME to this installation directory.
Source Installation
This type of installation is for users who want to build HeritrixWeblab from source to possibly extend its capabilities. The source for HeritrixWeblab, heritrix-1.10.0, can be downloaded from here. It already contains the libraries and modifications necessary to run Web Lab customizations. However, you can also download the source of the customization packages namely, weblabWS-1.0.1 and heritrixWeblab-1.0 from the same place. This document will use the variable $HERITRIX_SRC_HOME to point to the installation directory of the source package heritrix-1.10.0. To build HeritrixWeblab, you will need to install the following additional software:
Apache Maven 1.0.2 (Declare a variable $MAVEN_102_HOME pointing to its installation directory)
Building Web Lab Heritrix Customization Packages (Optional)
Typically, this step can be skipped unless one wants to change the code written for customization of Heritrix core. Download the two Web Lab customization packages weblabWS-1.0.1 and heritrixWeblab-1.0 as mentioned above. Also install the following version of Maven, in addition to the version installed above.
Apache Maven 2.0.4 (Declare a variable $MAVEN_204_HOME pointing to its installation directory)
Why is a separate version of Maven needed? Because the core Heritrix code was developed using the older Maven 1.0.2 release, while the customizations were developed using Maven 2.
Now follow the following commands to build and install these packages.
cd /path to/weblabWS-1.0.1
$MAVEN_204_HOME/bin/mvn install
cd /path to/heritrixWeblab-1.0
$MAVEN_204_HOME/bin/mvn install
cp /path to/weblabWS-1.0.1/target/weblabWS-1.0.1.jar $HERITRIX_SRC_HOME/lib
cp /path to/heritrixWeblab-1.0/target/heritrixWeblab-1.0.jar $HERITRIX_SRC_HOME/lib
Building Heritrix (Required)
Execute the following commands to build Heritrix:
cd $HERITRIX_SRC_HOME
$MAVEN_102_HOME/bin/maven dist
The build process will now start. Go grab a cup of tea; this process can take up to 20 minutes. By the time this process completes a directory $HERITRIX_SRC_HOME/target/heritrix-1.10.0 will be created. Create a variable $HERITRIX_HOME and make it point to this directory.
How to Crawl the Weblab Using HeritrixWeblab
This section of the tutorial assumes that you have followed the directions in the installation section, and installed HeritrixWeblab on your machine.
Starting HeritrixWeblab
To start HeritrixWeblab, enter the following command in a shell:
$HERITRIX_HOME/bin/heritrix –admin=admin:letmein
where:
$HERITRIX_HOME = Path to HeritrixWeblab installation directory.
admin:letmein = The username and password for this installation which default to the shown values.
After a few moments, you should see the following output in your shell:
Sat Nov 25 17:53:17 EST 2006 Starting heritrix.....
Heritrix 1.10.0 is running.
Web console is at: http://127.0.0.1:8080
Web console login and password: admin/letmein
This output indicates that HeritrixWeblab was succesfully started on port 8080 of your local machine.
Changing HeritrixWeblab Port
By default, HeritrixWeblab starts on port 8080. If the default port is not acceptable, it can be changed by editing the heritrix.cmdline.port property in $HERITRIX_HOME/conf/heritrix.properties.
Interacting with HeritrixWeblab Web Interface
With HeritrixWeblab successfully started, you can point your browser to the URL http://localhost:8080/login.jsp to see the HeritrixWeblab web interface. The following page should be visible to you:
Insert the appropriate user name and password in these fields. The user name and password entered here, are the same as those entered in the previous step. Click on Login. This should open up the HeritrixWeblab administration console:
Administration Console
The HeritrixWeblab administration console is divided into the the following views. A detailed description of each of these views can be obtained from The Heritrix User Manual.
Console
The console view contains information about the currently running crawl. It lists basic information such as the number of URLs fetched, total memory consumption and total time used. It also shows a progress bar for a running crawl that shows the progress in terms of the currently scheduled URLs in the crawl frontier.
Jobs
A Job in Heritrix is synonymous to a crawl. Thus, the Jobs view lets one create and manage new jobs.
Profile
Jobs can be created based upon Profiles. A profile allows one to set defaults e.g. the seed list, link extraction processors etc. that would apply to all jobs based on that profile. The Profile view lets one create and manage profiles.
Logs
Each crawl job may generate log files which can be viewed in the Logs view.
Report
The Report views contain high level reports on the status of a crawl. These include reports on fetching of seeds, progress statistics etc.
Settings
The Settings view lets one fine tune the parameters set for a given crawl on a per host basis.
Help
This view is rather self explanatory.
Creating your first Job
To create your first crawl job, click on the Job view in the Administration Console. As a result, you should see the following web page:
The Create New Job section offers various ways of creating a new crawl job. Pending Jobs, on the other hand, lists jobs that have been created but not started. Lastly, Completed Jobs lists jobs that have been completed along with various options for each.
A description of the various ways of creating a new job can be seen in the Crawl Job section of Heritrix user manual. For now lets focus on creating a job from scratch by clicking on the With defaults link. The following page should become visible:
The name of a crawl and its description together define a human readable interface to a crawl job. The Seeds text box takes the crawl seed URLs, one per line. For the purpose of this tutorial, lets choose the following values:
Selecting Job Modules
Now click on the Modules link to further refine this job. The following page shows up:
This page shows a list of processors that can apply on each URL during the course of a crawl. The major modules visible include
Crawl Scope
Crawl Scope defines the conditions under which a URL is considered within the scope of a crawl. A URL is within the crawl scope if it should be fetched for that crawl. By default, HeritrixWeblab chooses org.archive.crawler.deciderules.DecidingScope which makes the scoping decisions based on a list of rules to be discussed under 'Decide Rules'. For the purpose of this job, lets keep the default choice of HeritrixWeblab. If we wanted to override this default, we could choose a substitute from the drop down menu associated with the Crawl Scope Module and click on Change.
Frontier
The Frontier module provides a persistent way of keeping track of the URLs scheduled to be visited in the crawl. Initially this module contains only the seeds of the crawl, but as the crawl proceeds newly discovered URLs are added to this module. HeritrixWeblab, by default, chooses org.archive.crawler.frontier.BdbFrontier which is a reasonable choice. The default could be overridden using the drop down menu of the Frontier.
Processor Chains
Heritrix employs the concept of processor chains, each containing a list of processors that perform the same logical operation in distinct ways. The major processor chains in Heritrix are:
Pre-Processor Chain
Processors in this chain handle a URL before its contents are fetched. These processors can be thought of as performing last minute checks to ensure that a URL is within the crawl scope, and that its contents can be fetched. The default Heritrix processors in this chain include org.archive.crawler.prefetch.Preselector and org.archive.crawler.prefetch.PreconditionEnforcer. A description of these processors can be obtained by clicking the Info link next to their entries in the Pre-Processor chain. For the purposes of HeritrixWebLab, we do not need the PreconditionEnforcer. This is because, we are fetching page contents from The Web Lab web service and hence, do not need the DNS fetches PreconditionEnforcer enforces on the crawl. In fact, our crawl cannot proceed without removing this processor, therefore, go ahead an remove it from the Pre-Processor chain by clicking on Remove next to its name.
Fetch Chain
This chain of processors is responsible for fetching a URL's content. By default, Heritrix adds org.archive.crawler.fetcher.FetchDNS and org.archive.crawler.fetcher.FetchHTTP to this chain with the former ensuring DNS lookups, while the latter fetching URL page content from the world wide web (WWW). Since at the Web Lab, we are primarily interested in the URLs stored in our databases, therefore, we do not need to fetch URL page content from the web. Remove this processor from the Fetch Chain by clicking Remove next to it. With the FetchHTTP removed, lets add the HeritrixWeblab custom fetcher to fetch page content from The Web Lab. From the drop down menu associated with the Fetch Chain, choose edu.cornell.tc.weblab.heritrix.fetcher.FetchWebLab and click on Add. This should result in the following page.
The Up and Down buttons allow one to change the position of a processor within a processor chain. The higher a processor in this chain, the earlier it gets a chance to process a URL. There is another custom HeritrixWeblab fetcher that may be of interest to users. If a URL was found in the Weblab database but its corresponding content was not, the content can be fetched from the Internet Archive's Wayback Machine. To see how this works, lets add the processor edu.cornell.tc.weblab.heritrix.fetcher.FetchWayBackMachine to the Fetch processor chain by selecting it from the drop down menu and clicking on Add. Notice that this new processor gets added to the end of the Fetch processor chain. In this case, the order of processors is very important because the FetchWaybackMachine processor only fetches pages that were found in the the Weblab databases. This determination of URL existence in Weblab database is done by FetchWebLab, therefore, it must precede FetchWaybackMachine for the latter to perform its task.
Extractor Chain
This chain of processors is responsible for extracting outlinks from the page content fetched for each page. By default, HeritrixWeblab adds a handful of extractors, each one of which is designed to extract links from a different type of content. You can get details on each by clicking on Info next to the processor of interest. For the purpose of Weblab, the outlinks of a page are stored in Weblab database. In other words, we are not interested in harvesting them out of fetched page content. Therefore, lets remove the default extractors, one at a time, by clicking on Remove and add the custom HeritrixWeblab extractor edu.cornell.tc.weblab.heritrix.extractor.ExtractorWebLab from the drop down menu. The resulting page should look like this:
Writers Chain
This chain of processors is responsible for writing the fetched contents and outlinks of a URL to file. The default writer, org.archive.crawler.writer.ARCWriterProcessor, does not write the outlinks of a URL to file, therefore, lets replace it with HeritrixWeblab's custom writer: edu.cornell.tc.weblab.heritrix.writer.WeblabExperimentalWARCWriterProcessor. This processor writes outlinks of a URL to file even if no page content was retrieved for it, which is not supported by the default Heritrix writer.
Post-Processor Chain
The Post-Processor chain includes processors that process every URL that passes the Pre-Processor Chain. In other words, even if there was an HTTP 404 error in fetching the contents of a URL, due to which the extractor chain is not run on it, the post-processor chain still acts on that URL. By default, It includes processors that conditionally add the outlinks of a URL to the frontier. The default processors are sufficient for the purpose of this crawl.
Statistics Processor
This processor is responsible for keeping track of the progress statistics of a crawl. This is an optional but useful processor, therefore lets accept the default values.
Selecting Job Submodules
Click on the Submodules tab to arrive upon the following page:
The Submodules page allows the addition of further parameters to the modules selected earlier. More specifically, notice that the very first section on this page allows us to select the Decide Rules to apply in the Deciding Scope we chose in the previous section. Remember, decide rules together determine whether a URL is within the scope of a crawl. A handful of useful decide rules are already included by default. You can find out more about the functionality of each by clicking on the '?' following each rule. For our purposes, we can try something simple. Logically speaking, lets choose a set of rules that would allow us to accept any URL in the crawl except those that are more than 'x' hops away from the crawl seed set. This can be done by removing all the default rules and adding the following ones from the drop down menu:
org.archive.crawler.deciderules.AcceptDecideRule
org.archive.crawler.deciderules.TooManyHopsDecideRule
Notice that the order of rules is important. If we were to flip this order than the AcceptDecideRule will override any rejections made by TooManyHopsDecideRule. The page after these changes should look like:
If you scroll down the page, you will see further options for customizing the performance for each module. For example, you can see the default URL canonicalization rules that are applied on each URL by the frontier. For the purpose of this tutorial, the default values are acceptable for the remaining modules.
Refining Job Settings
Click on the Settings tab to dig deeper into customizing the crawl settings. The following page should now be visible:
Most of the fields have acceptable default values. Lets customize a few as follows:
- Change the name of the Crawling Organization. In the case of Web Lab, this can be Cornell University.
- Reduce the
max-toe-threadsto five. While it is fine to have the default 50 toe threads, we will soon come across a situation where it would be important to have a small number of threads. - Next, customize the RejectIfTooManyHops decide rule by specifying
max-hopsproperty which defines the number of hops from seed that can be called too many hops. Lets set it to two hops. - Change the default value of http headers
user-agentand from to reflect the person creating the crawl. For Weblab crawlers, we can use the followinguser-agent:
Mozilla/5.0 (compatible; heritrix/1.10.0 +http://www.infosci.cornell.edu/SIN/WebLab/).
Note that the from field should be set to the email address of the crawl operator.
- For the WebLabFetcher processor, four fields are available. A description for each can be obtained by clicking on the “?” following each. For now lets accept the defaults. Note that if
analyze-page-contentfield is set to false, the WebLabFetcher does not fetch page content rather it only fetches the outlinks of the URL in question. - The InternetArchiveFetcher, similarly has a number of fields whose values can be altered. For now lets accept the defaults. Notice that there is no
analyze-page-contentfield in InternetArchiveFetcher. Hence, if this fetcher is added, it is assumed that the user is interested in page content, otherwise, there is no point in adding it. Moreover, internally, the InternetArchiveFetcher changes every URL to point to the Internet Archive Wayback Machine version of the URL. The URL is switched back to its original form once the InternetArchiveFetcher lets go of it. In other words, the InternetArchiveFetcher only knows of one host, and that is the Wayback Machine host. Since the crawl frontier, which is responsible for maintaining the crawl politeness policy does not know of this internal switch, therefore, it cannot maintain the Heritrix politeness policy if the InternetArchiveFetcher is used. In this case, it is important to limit themax-toe-threadsof crawl to around 5 so that the Wayback Machine host does not deem our crawler impolite.
Submitting the Job
The Refinements and Override views allow an even detailed customization of the crawl job, however, that is beyond the scope of this tutorial. We are now ready to submit this job for crawling. We can do that by clicking the Submit Job tab near the top or bottom of the current page. This results in the following page:
Notice that the job we created has shown up under the Pending Jobs section of Jobs page. We can create multiple crawls and submit them all to the Pending Jobs section before we start any of them. However, for now, lets stick with a single crawl.
Starting a Job
To start our newly created job, switch to the Console view of the HeritrixWeblab administration console by clicking on the Console tab. The following page should show:
where the highlighting was added to show our newly created job. To start crawling, click on the the Start link next to HOLDING JOBS. This should kick start the crawling process. If you click on the Refresh link near the bottom of this page, you may see a progress bar like:
Let the crawl run its course; it should complete in a few minutes. Once the crawl completes, switch to the Job view and you will see our finished job along with logs and reports generated for it.
Analyzing a Crawl
Heritrix provides a number of ways of analyzing a crawl. From the Jobs page, one can navigate to the logs generated by the crawl. One can also view the final crawl reports generated. Each of these reports and logs are quite self-explanatory, therefore, they will not be discussed here. One can consult the Heritrix User Manual for more details.
Inspecting the WARC file
The WeblabExperimentalWARCWriterProcessor added in the Writer processor chain generates WARC files of the fetched page content and farmed outlinks. For the crawl we just ran, these WARC files are available as gzipped files in the directory $HERITRIX_HOME/jobs/tutorial-20061126003346885/warcs/. Note that the timestamp following the name tutorial will be different for each crawl.
Known Bugs in HeritrixWeblab
Following is a list of known bugs in the HeritrixWebLab code:
edu.cornell.tc.weblab.heritrix.fetcher.FetchWayBackMachine, is an extension of the FetchHTTP class provided by Heritrix. FetchHTTP is designed to fetch the raw contents of a URL only once, which makes the process of cleaning the fetched content of Wayback Machine modifications rather difficult. Currently, FetchWayBackine is able to clean such modifications, however, it is not yet able to update the size of the contents. In other words, even after cleaning the contents of a URL, which may result in reducing the size of the fetched content, a call toCrawlURI.getContentLength()will return the pre-cleaning length.- If you visit this page and click around on a the links, chances are you may encounter an error page where the Wayback Machine was not able to retrieve the requested page successfully. The error page is not a 404, rather it represents a successful fetch. The error page comes in various flavors one of which is given below. The point though is that if you were to simply go back and retry the link, a normal fetch is likely to happen. Currently, HeritrixWeblab does not retry such fetches, rather it treats them as successful fetches.
- Lastly, for some URLs, the crawler can simply never fetch their contents. Every time it tries to fetch such URLs, it encounters an error, and reschedules the URL for fetching in the next fetch cycle. As a result, the crawl does not terminate until the maximum number of retries, which is specified through the
max-retriesproperty of the frontier, expires. This is a serious drawback of the current implementation; no solution has been found yet. In the process of debugging the following insights were gained:- A URL can either always throw an error in fetch cycle or it may never throw an error.
- The error is thrown in the FetchWayBackMachine processor; the problem does not appear if only FetchWebLab is used.
- FetchWayBackMachine is an extension of the Heritrix class FetchHTTP. It delegates the grunt work of downloading contents to FetchHTTP. The only change it makes to the actual process of fetching is to enable FetchHTTP to follow redirects, which it does not do by default. The redirects are necessary because Wayback Machine always returns a redirect. The error comes from FetchHTTP part of code, and is caused because Heritrix uses un-resettable streams for recording the fetched content. In the process of fetching the culprit URLs, one of Heritrix's 3rd party dependencies tries to reset recording input stream, which causes an IOException.
- Even if FetchWayBackMachine is replaced by an extension of FetchHTTP that does not do anything more than enabling following of redirects on FetchHTTP, the error still shows up. This shows that the error is rooted in the particular way the redirects are followed by FetchHTTP.
- If the following of redirects is turned off, the error does not show up and a standard redirection page is downloaded instead of the actual page content.
How to Extend HeritrixWeblab
The section on installing HeritrixWeblab from source is a prerequisite to understanding how to extend HeritrixWeblab.
While HeritrixWeblab comes pre-packaged with a basic framework to allow crawling of The Web Lab databases, its true potential can be realized only by customizing it further to the end user's need. Any such extension will have to depend at the very least upon:
heritrix-1.10.0 (and its dependencies)heritrixWeblab-1.0 (and its dependencies)weblabWS-1.0.1 (and its dependencies)
The resulting package's jar file can be dropped in the directory $HERITRIX_SRC_HOME/lib Moreover, to have Heritrix recognize your additions, appropriate files will have to be edited in $HERITRIX_SRC_HOME/src/conf/ directory. For example, say you create a new fetcher called edu.univ.dept.heritrix.fetcher.FetchSpecial. You will have to add an entry for it in $HERITRIX_SRC_HOME/src/conf/modules/Processor.options, and rebuild HeritrixWeblab as described here , to have your additions reflected.
Appendix A - Removing Modifications from Wayback Machine Replies
As noted above, the fetcher edu.cornell.tc.weblab.fetcher.FetchWayBackMachine fetches the content of a URL from the Internet Archive's Wayback Machine, if the URL was found in the Weblab database, but its corresponding content was not. The Wayback Machine makes a number of modifications to the actual content of the URL that are relevant to its domain. Since a researcher would probably like to have access to the original content of a page, therefore, the FetchWayBackMachine processor removes the following modifications made by the Wayback Machine:
- Rewriting of links to point back to Internet Archive. For example, a URL like http://www.cnn.com may have been modified as http://web.archive.org/web/20060830104708/http://www.cnn.com or as http://web.archive.org/web/20060830104708js_/http://www.cnn.com.
- Insertion of comments indicating that some scripts may have been added by the Internet Archive to page content.
- Insertion of scripts to redirect links to Internet Archive.
