Norvig Web Data Science Award

About the data

Warc files and sequence files

The March 2014 release of the Common Crawl is delivered in the form of warc files and are stored per segment (557 segments in total). Each segment contains gzipped warc files (100 files per segment) which contain the raw crawl output. That is: it contains warc records with metadata, the http request and the http response. The wet and wat files (also gzipped) contain the parsed html (text/plain) and metadata in a JSON format, respectively. More information can be found here and the file format specification can be found here. Note that warc, wat and wet are all considered warc files per the specification and consist of records where a warc header is followed by some optional content - they differ in the type of content.

Another final thing to note is that each crawled page is often represented by consecutive records with different WARC-Types: metadata, request, response, warcinfo, conversion etc. We suggest you study the files carefully (they are all text files), a tarball with an example of each files is available here: CC-TEST-2014-10-segment-1394678706211.tar.gz (943MB).

In addition to the warc, wet and wat files we decided to convert the warc.gz (raw crawl files) to a sequence file format. Each warc file is around 600 to 800MB in size and the gzipped files can only be process by one mapper. I.e. they are not splittable. To optimise this and allow random access (if needed) to the warc files we converted them to Hadoop sequence files. The keys of the sequence file are longs (LongWritable) which run from 0 to the number of warc records per segment. The values are a String (Text) where each String parses to a single warc record. The choice for a String representation was made to allow flexibility in the use of the different Java warc libraries (although we highly recommend the Java Web Archive Toolkit). For dealing with the files on our cluster we have prepared some utility classes (mainly hadoop inputformats and recordreaders) which can be found here: warcutils

Common Crawl on our cluster

By now you will probably be wondering where the files are located and how to get to them. The entire March 2014 crawl is located on HDFS on the Hathi cluster at SURFsara:

/data/public/common-crawl/crawl-data/CC-MAIN-2014-10

The directory structure mirrors that of the Common Crawl data on Amazon S3. There are subdirectories per segment and each segment has subdirectories per file type: warc, wat, wet and seq (for our converted sequence files).

In addition to the full data set there is a subset of the data - one segment - in the following location:

/data/public/common-crawl/crawl-data/CC-TEST-2014-10

Please use this data for debugging and testing purposes. Doing a run on the full dataset will use up significant cluster resources and should only be done for final results.

Some examples

Step 1: Getting the example code

We have prepared some example code to show how to work with each of the files in the Common Crawl data. This should give you a good starting point for your own hacking. The code can be found on our github: warcexamples source. Precompiled binaries are available from our maven repository: http://beehub.nl/surfsara-repo/releases/. To clone the source:

git clone https://github.com/norvigaward/warcexamples

The warcexamples project contains both Ant/Ivy and Maven build files and should by easy to use with your favourite editor/ide. If you have any questions on building or compiling the source please let us know by sending an email to: hadoop.support@surfsara.nl.

Step 2: Running the mapreduce examples

In order to run the examples you should have the Hadoop client software prepared and a valid Kerberos ticket (for details see the documentation here and here). Once you have done this you can run the examples with the yarn jar command:

yarn jar warcexamples.jar

Or depending on how you built the examples:

yarn jar warcexamples-1.1-fatjar.jar

Running the above command should show you a list of the current example programs:

NER: mapreduce example that performs Named Entity Recognition on text in wet files. See the nl.surfsara.warcexamples.hadoop.wet package for relevant code. Usage:
```
yarn jar warcexamples.jar ner hdfs_input_path hdfs_output_path
```
servertype: extracts the servertype information from the wat files. See the nl.surfsara.warcexamples.hadoop.wat package for relevant code. Usage:
```
yarn jar warcexamples.jar servertype hdfs_input_path hdfs_output_path
```
href: parses the html in warc files and outputs the url of the crawled page and the links (href attribute) from the parsed document. See the nl.surfsara.warcexamples.hadoop.warc package for relevant code. Usage:
```
yarn jar warcexamples.jar href hdfs_input_path hdfs_output_path
```
Note that the input path should consist of sequence files.
headers: dumps the headers from a wat, warc or wet file (gzipped ones). This is not a mapreduce example but files are read from HDFS. This can be run from your local computer or the VM. See the nl.surfsara.warcexamples.hadoop.hdfs package for relevant code. Usage:
```
yarn jar warcexamples.jar headers hdfs_input_file
```

If you want to specify hadoop specific options to the programs you can add these to the command line. For example the following command runs the href example with 10 reducers and max 4GB memory for the JVM:

yarn jar warcexamples.jar href -D mapreduce.job.reduces=10 \
         -D mapred.child.java.opts=-Xmx4G hdfs_input_path hdfs_output_path

Step 3: Explore the code

Most of the examples follow the same structure: an implementation of the org.apache.hadoop.util.Tool interface coupled to a custom mapper with identity or stock reducer. The dependencies are handled by Ivy/Maven. All the mapreduce examples make use of our own warcutils package for reading data from HDFS (maven repository here).

Step 4: Running some pig examples

Apache Pig allows you to write data-analysis programs in a data-flow language called Pig Latin. The pig compiler will convert your Pig Latin program to a series of MapReduce jobs. Using Pig allows you to think about your problem in a series of relations instead of MapReduce. We provide Pig Loaders for the warc and seq files as part of the warcutils package. The Pig Loaders do not expose all the information that is available in the warc files, but your are free to extend them to extract the information relevant for your project.

The warcexamples repository contains two Pig examples scripts that demonstrate how you can process the web crawl using Pig. One of them looks at the Content-Type and Content-Length of the records and calculates the average length per type. Running a pig job is very simple:

$ cd ~/warcexamples/pig
$ pig sizepertype.pig

With Pig it is also possible to run the job on the local machine by adding -x local to the command. Running pig locally on a single segment is a very fast way to explore the data and develop your algorithm. You do need to make sure that you have the data and dependencies (jars) available on your local machine.

General good practice: make sure your code is up to date

Whenever bugs are found in our example code, the utilities or the hadoop client code we fix them and push them to our code repository on github: https://github.com/norvigaward/ and to our maven repository on beehub: http://beehub.nl/surfsara-repo/releases/. This however means that you might have a version that includes bugs. To make sure you always have the latest code, we recommend you perform a regular update and/or instruct your own packages to depend on the most up-to-date version. To retrieve the latest example code from github:

Step 1: Start a terminal

Step 2: Perform the update

Go to the directory that holds the code (in this case the examples, the process is the same for the other projects):

naward@hadoop-vm:~$ cd git/warcexamples

Use git to fetch the latest code

naward@hadoop-vm:~/git/warcexamples$ git pull