The March 2014 release of the Common Crawl is delivered in the form of warc files and are stored per segment (557 segments in total). Each segment contains gzipped warc files (100 files per segment) which contain the raw crawl output. That is: it contains warc records with metadata, the http request and the http response. The wet and wat files (also gzipped) contain the parsed html (text/plain) and metadata in a JSON format, respectively. More information can be found here and the file format specification can be found here. Note that warc, wat and wet are all considered warc files per the specification and consist of records where a warc header is followed by some optional content - they differ in the type of content.
Another final thing to note is that each crawled page is often represented by consecutive records with different WARC-Types: metadata, request, response, warcinfo, conversion etc. We suggest you study the files carefully (they are all text files), a tarball with an example of each files is available here: CC-TEST-2014-10-segment-1394678706211.tar.gz (943MB).
In addition to the warc, wet and wat files we decided to convert the warc.gz (raw crawl files) to a sequence file format. Each warc file is around 600 to 800MB in size and the gzipped files can only be process by one mapper. I.e. they are not splittable. To optimise this and allow random access (if needed) to the warc files we converted them to Hadoop sequence files. The keys of the sequence file are longs (LongWritable) which run from 0 to the number of warc records per segment. The values are a String (Text) where each String parses to a single warc record. The choice for a String representation was made to allow flexibility in the use of the different Java warc libraries (although we highly recommend the Java Web Archive Toolkit). For dealing with the files on our cluster we have prepared some utility classes (mainly hadoop inputformats and recordreaders) which can be found here: warcutils
By now you will probably be wondering where the files are located and how to get to them. The entire March 2014 crawl is located on HDFS on the Hathi cluster at SURFsara:
In addition to the full data set there is a subset of the data - one segment - in the following location:
We have prepared some example code to show how to work with each of the files in the Common Crawl data. This should give you a good starting point for your own hacking. The code can be found on our github: warcexamples source. Precompiled binaries are available from our maven repository: http://beehub.nl/surfsara-repo/releases/. To clone the source:
git clone https://github.com/norvigaward/warcexamplesThe warcexamples project contains both Ant/Ivy and Maven build files and should by easy to use with your favourite editor/ide. If you have any questions on building or compiling the source please let us know by sending an email to: hadoop.support@surfsara.nl.
In order to run the examples you should have the Hadoop client software prepared and a valid Kerberos ticket (for details see the documentation here and here). Once you have done this you can run the examples with the yarn jar command:
yarn jar warcexamples.jar
Or depending on how you built the examples:
yarn jar warcexamples-1.1-fatjar.jar
Running the above command should show you a list of the current example programs:
NER: mapreduce example that performs Named Entity Recognition on text in wet files. See the nl.surfsara.warcexamples.hadoop.wet
package for relevant code.
Usage:
yarn jar warcexamples.jar ner hdfs_input_path hdfs_output_path
servertype: extracts the servertype information from the wat files. See the nl.surfsara.warcexamples.hadoop.wat
package for relevant code. Usage:
yarn jar warcexamples.jar servertype hdfs_input_path hdfs_output_path
href: parses the html in warc files and outputs the url of the crawled page and the links (href attribute) from the parsed document. See the nl.surfsara.warcexamples.hadoop.warc
package for relevant code. Usage:
yarn jar warcexamples.jar href hdfs_input_path hdfs_output_path
Note that the input path should consist of sequence files.
headers: dumps the headers from a wat, warc or wet file (gzipped ones). This is not a mapreduce example but files are read from HDFS. This can be run from your local computer or the VM. See the nl.surfsara.warcexamples.hadoop.hdfs
package for relevant code. Usage:
yarn jar warcexamples.jar headers hdfs_input_file
If you want to specify hadoop specific options to the programs you can add these to the command line. For example the following command runs the href example with 10 reducers and max 4GB memory for the JVM:
yarn jar warcexamples.jar href -D mapreduce.job.reduces=10 \ -D mapred.child.java.opts=-Xmx4G hdfs_input_path hdfs_output_path
Most of the examples follow the same structure: an implementation of the org.apache.hadoop.util.Tool
interface coupled to a custom mapper with identity or stock reducer. The dependencies
are handled by Ivy/Maven. All the mapreduce examples make use of our own warcutils package for reading data from HDFS (maven repository here).
Apache Pig allows you to write data-analysis programs in a data-flow language called Pig Latin. The pig compiler will convert your Pig Latin program to a series of MapReduce jobs. Using Pig allows you to think about your problem in a series of relations instead of MapReduce. We provide Pig Loaders for the warc and seq files as part of the warcutils package. The Pig Loaders do not expose all the information that is available in the warc files, but your are free to extend them to extract the information relevant for your project.
The warcexamples repository contains two Pig examples scripts that demonstrate how you can process the web crawl using Pig. One of them looks at the Content-Type and Content-Length of the records and calculates the average length per type. Running a pig job is very simple:
$ cd ~/warcexamples/pig
$ pig sizepertype.pig
With Pig it is also possible to run the job on the local machine by adding -x local
to the command. Running pig locally on a single segment is a very fast way to explore the data and develop your algorithm. You do need to make sure that you have the data and dependencies (jars) available on your local machine.
Whenever bugs are found in our example code, the utilities or the hadoop client code we fix them and push them to our code repository on github: https://github.com/norvigaward/ and to our maven repository on beehub: http://beehub.nl/surfsara-repo/releases/. This however means that you might have a version that includes bugs. To make sure you always have the latest code, we recommend you perform a regular update and/or instruct your own packages to depend on the most up-to-date version. To retrieve the latest example code from github:
Go to the directory that holds the code (in this case the examples, the process is the same for the other projects):
naward@hadoop-vm:~$ cd git/warcexamples
Use git to fetch the latest code
naward@hadoop-vm:~/git/warcexamples$ git pull