Norvig Web Data Science Award

show what you can do with 3 billion web pages
by SURFsara and CommonCrawl

To start experimenting you need to know the location and the format of the dataset. Here we provide pointers and give the information you need to run your Hadoop programs on the VM and cluster.

Fair use policy

First of all: make sure you are familiar with the fair-use policy.

The datasets

We have created 2 subsets of the Common Crawl set hosted at SURFsara: a single file available for download for on the VM, and a single segment on the Hadoop cluster. Location and size of the test sets.

The dataset contains four different type of files: SEQ, WARC, WET and WAT files. You can find a description of the file formats on the examples page.

Using the Hadoop cluster

Once your program runs correctly on your local machine it is time to move to a bigger dataset and a bigger machine: SURFsara's Hadoop cluster (called Hathi). There are just a few things that you need to pay attention to:

Submitting a MapReduce job

As for the examples we showed before, you should build a jar from your source and run it with yarn jar.

Submitting a pig job

You can run a pig job on the cluster by removing the '-x local' from the command line:

$ kinit USERNAME
Password for USERNAME@CUA.SURFSARA.NL:
$ pig myjob.pig