The Web Laboratory: Crawl Sizes Statistics
This is a partial list of the the web crawls in the collection of the Internet Archive and the progress in downloading them to Cornell. All sizes are in terabytes (TB)
Web pages are stored in compressed files, known as ARC files, with associated metadata files, known as DAT files. Many web pages are stored in a single ARC file. The columns "Total Files" are a count of the ARC and DAT files in the corresponding collections.
The two final columns give the order of priority in downloading the crawls to Cornell and the completed dates.
| Crawl Name | Time Period | Total Files per Internet Archive |
Total ARC Size | Total DAT Size | Total Crawl Size | Total Files per Crawl as Recorded |
Download Order | Completed Date |
|---|---|---|---|---|---|---|---|---|
| DD | Feb-April 2001 | 124,433 | 1.9 | 0.306 | 2.2 | 187,538 | 5th | unknown |
| DE | March-June 2001 | 55,809 | 4.3 | 0.419 | 4.7 | 95,328 | n/a | |
| DF | May-August 2001 | 97,472 | 6.8 | 0.442 | 7.2 | 148,402 | n/a | |
| DG | July-Oct 2001 | 97,304 | 6.2 | 0.324 | 6.5 | 134,872 | n/a | |
| DH | Sept-Dec 2001 | 60,485 | 3.9 | 0.251 | 4.1 | 84,286 | n/a | |
| DI | Nov 2001-Feb 2002 | 130,966 | 7.5 | 0.485 | 8 | 162,130 | n/a | |
| DJ | Jan-April 2002 | 161,734 | 9.2 | 0.602 | 9.8 | 204,560 | 2nd | 06/11/2006 |
| DK | March-June 2002 | 124,079 | 5.5 | 0.361 | 5.9 | 120,622 | n/a | |
| DL | Jan-July 2002 | 205,090 | 11.2 | 0.661 | 11.9 | 249,568 | n/a | |
| DM | July-Oct 2002 | 80,601 | 10.1 | 0.66 | 10.8 | 220,002 | n/a | |
| DN | Sept-Dec 2002 | 185,006 | 11.5 | 0.814 | 12.3 | 250,614 | n/a | |
| DO | Nov 2002-Feb 2003 | 232,666 | 13.3 | 0.793 | 14.1 | 289,636 | n/a | |
| DP | Jan-April 2003 | 187,130 | 12.7 | 0.714 | 13.4 | 276,112 | 3rd | 07/03/2007 |
| DQ | March-May 2003 | 87,353 | 24.1 | 1.3 | 25.4 | 522,466 | n/a | |
| DR | May-August 2003 | 28,131 | 20.7 | 1.2 | 21.9 | 448,582 | n/a | |
| DS | July-Oct 2003 | 38,029 | 24.7 | 1.4 | 26.1 | 535,780 | n/a | |
| DT | Sept-Dec 2003 | 49,246 | 21.4 | 1.3 | 22.7 | 464,550 | n/a | |
| DU | Nov 2003-Feb 2004 | 86,859 | 24.6 | 1.4 | 26.1 | 533,316 | n/a | |
| DV | Jan-April 2004 | 126,266 | 23.7 | 1.9 | 25.6 | 515,260 | 4th | unknown |
| DW | March-June 2004 | 72,050 | 26 | 2.2 | 28.2 | 563,766 | n/a | |
| DX | May-August 2004 | 662,525 | 39.5 | 3.6 | 43.1 | 856,164 | n/a | |
| DY | July-Oct 2004 | 714,041 | 35 | 3.8 | 38.7 | 761,410 | n/a | |
| DZ | Sept-Dec 2004 | 681,237 | 32.3 | 3.3 | 35.5 | 705,370 | n/a | |
| EA | Nov 2004-Feb 2005 | 766,220 | 36.2 | 3.8 | 40 | 791,934 | n/a | |
| EB | Jan-March 2005 | 392,875 | 18.1 | 2.2 | 20.3 | 393,760 | 1st | unknown |
| EC | March-April 2005 | 304,660 | 13.8 | 1.8 | 15.6 | 301,268 | n/a | |
| ED | March-August 2005 | 714,120 | 35.6 | 4.3 | 39.9 | 769,948 | n/a | |
| EE | n/a | 0 | 13.5 | 1.7 | 15.2 | 309,110 | n/a | |
| EF | n/a | 0 | 26.6 | 1.8 | 28.4 | 617,400 | n/a | |
| EG | n/a | 0 | 46 | 2.2 | 48.2 | 1,020,064 | n/a |
Legend:
- Total Files IA - The number of files for the crawl as per the Internet Archive
- Total ARC Size - ARC size of the crawl as per the database
- Total DAT Size - DAT size of the crawl as per the database
- Total Crawl Size - equals Total ARC Size + Total DAT Size
- Total Files by RC - Total number of files per crawl recorded in the WebLabTracking Database.
- Download Order - The order in which crawls have to be downloaded
Crawls in 2005
Crawls in 2002
Crawls in 2003
Crawls in 2004
Crawls in 2001 - Completed Date - Date on which a crawl download is completed