Norvig Web Data Science Award

To start experimenting you need to know the location and the format of the dataset. Here we provide pointers and give the information you need to run your Hadoop programs on the VM and cluster.

Fair use policy

First of all: make sure you are familiar with the fair-use policy.

The datasets

We have created 2 subsets of the Common Crawl set hosted at SURFsara: a single file available for download for on the VM, and a single segment on the Hadoop cluster. Location and size of the test sets.

The dataset contains four different type of files: SEQ, WARC, WET and WAT files. You can find a description of the file formats on the examples page.

Using the Hadoop cluster

Once your program runs correctly on your local machine it is time to move to a bigger dataset and a bigger machine: SURFsara's Hadoop cluster (called Hathi). There are just a few things that you need to pay attention to:

Change the input path to the TEST set on the cluster (see above).
Authenticate before submitting your job. You can do this by opening a terminal and run kinit USERNAME. This is the username you have received by email after applying. You only need to do this once per session.

Submitting a MapReduce job

As for the examples we showed before, you should build a jar from your source and run it with yarn jar.

Submitting a pig job

You can run a pig job on the cluster by removing the '-x local' from the command line: