The Web Laboratory: GetPages Tool Documentation
The GetPages tool is available here.
There are separate versions of this tool for each database. To use the tool select a database and go to the tools page for that database. Note that you need to register and be approved for an account before you can use it.
Overview of the GetPages Tool
This tool allows you to retrieve a subset of the data present in the WebLab database. You can specify restrictions on each field which will then be used for retrieving appropriate data. The retrieved information can either be viewed on-line or downloaded as a WebLab TSV file. Below is a step-by-step explanation of how to use the tool.
I) Select by various restrictions
Different restrictions can be placed on the data to filter and limit the type of data that needs to be returned. The NOT LIKE checkbox can also be checked to specify what type of data is not required.
Step 1: Select By Collection
This is a primary option that allows you to select the Collection that you want to run your query on i.e. either on WebLibraryAmazon or WebLibraryCornell. The default has been set to the WebLibraryAmazon and can be changed.
The Crawl ID is a unique ID Corresponding to each crawl in the database. This can be selected from the combo box for more advanced queries. The default value for this option is set to ‘All crawls’
ArchiveTime: This is the time at which the webcrawler archived the page. You can place restrictions on the minimum and maximum date in the data. A blank restriction is treated as no restriction. The format of the date is MONTH/DAY/YEAR HOURS:MINUTES:SECONDS, for example 02/14/2005 06:30:22
Step 2: Select By URL
The URLs for all the indexed pages in the collection comprise of the URL Protocol, Host, Port, Path and Extension. The URL is split like this to make querying more efficient. You can place restrictions on each of the five fields individually. Restrictions are entered in plain text.
You can use the following wild cards:
- % matches any string
- _ matches any character
- [a-c] matches any character a, b or c
- [^a-c] matches any character either than a, b or c
If you wish to match one of the special characters " % _ or [ use the following:
- \" to match "
- [%] to match %
- [_] to match _
- [[] to match [
For example, "%.edu" AND NOT "%cornell.edu" in the Host field would match all pages from a domain ending in .edu, but that are not from the cornell.edu domain.
URL Protocol: This is the transfer protocol using which the page is hosted. For example: http, ftp, ssh etc. Wild cards (explained above) can be used in this field
URL Host: This is the hostname of the webpage for example: www.cornell.edu. Wild cards (explained above) can be used in this field.
URL Port: Port number for example: 80. Wild cards (explained above) can be used in this field.
URL Path: This is the path of the webpage following the hostname. For example: the page http://www.cs.cornell.edu/courses/ will have the hostname: www.cs.cornell.edu and a URL path /courses. The URL path can be specified and wild cards can be used to express it.
URL Extension: URL Suffix which defines the type of page, For example: .htm
Step 3: Select By Page Meta Data
The crawler, upon each crawl, saves the Meta Data of the page that it crawls. The user can restrict the kind of pages it wants to retrieve by specifying this meta data:
Document Title: This is the title of the webpage
MimeType: Indicates the Internet media type of the message content for each page, consisting of a type and subtype, for example
text/plainThe different MimeTypes can be picked from the combo box
Language: The language of the webpage.
Step 4: Select By IDs
This is an advanced option and should only be selected if a very specific query needs to be performed and the user knows the IDs in advance.
PageID, UrlID, HostID
These are identifiers used inside the database. They are 128-bit binary numbers. The are encoded into a textual form using Base64 encoding. To place restrictions on IDs, type in (or cut and paste from the results) one or more IDs into the corresponding text field, separated by a space or comma.
IPAddress
This is the IP Address of the server from which the page was retrieved.
II) Save, Load or View the query
Once you have entered all the restrictions, you can save your work to a file. This is useful if you want to run the query again later, or if you want to later use a different tool on the data you selected here. You can also load a previously-saved query.
The last button on the page lets you see the real SQL query generated from your choices. This is useful for debugging and optimizing.
