To start experimenting you need to know the location and the format of the dataset. Here we provide pointers and give the information you need to run your Hadoop programs on the VM and cluster.
First of all: make sure you are familiar with the fair-use policy.
We have created 2 subsets of the Common Crawl set hosted at SURFsara: a single file available for download for on the VM, and a single segment on the Hadoop cluster. Location and size of the test sets.
The dataset contains four different type of files: SEQ, WARC, WET and WAT files. You can find a description of the file formats on the examples page.
Once your program runs correctly on your local machine it is time to move to a bigger dataset and a bigger machine: SURFsara's Hadoop cluster (called Hathi). There are just a few things that you need to pay attention to:
kinit USERNAME
. This is the username
you have received by email after applying. You only need to do this
once per session.As for the examples we showed before, you should build a jar from your source and run it with yarn jar
.
You can run a pig job on the cluster by removing the '-x local' from the command line:
$ kinit USERNAME
Password for USERNAME@CUA.SURFSARA.NL:
$ pig myjob.pig