Norvig Web Data Science Award

show what you can do with 3 billion web pages
by SURFsara and Common Crawl

What's the fair-use policy?

At SURFsara we serve a multitude of users. It is important everybody gets a chance to use the service. We try to keep this process as rule-free as possible because we believe strict regulation is against the nature of scientific experimentation.

The fair-use policy we enforce here is aimed at giving everybody equal chances to use the available capacity. This means we will not allow for monopolization of the Hadoop cluster.

In order to make this process as smooth as possible we have the following tips:

  • Do not wait with running your experiment until the last moment. We will not move the deadline nor make exceptions!
  • Test your code on the test set on the cluster. This is a much smaller set and your tests will finish much earlier.
  • Try to use not more than around 50 reducers. Reduce slots won't become available before your job ends, so small jobs might have to wait for your big job.
  • Try to limit your runs on the complete dataset - the more often you will do complete runs, the more likely you will start monopolizing.

We monitor the cluster continuously. We might contact you if we notice you take an unfair amount of resources. We might also kill jobs that take an unfair amount of resources. But please try not to let it get to that.

We reserve the right to kill your jobs or deny you access at any given time, and at our discretion. This might be necessary if you do not comply with the fair-use policy, but other circumstances might also justify these actions.

Back to FAQ index


Where can I find the datasets?

Next to the Common Crawl set we provide a test set. The two sets can be found at:

  • Test set on the cluster: /data/public/common-crawl/crawl-data/CC-TEST-2014-10/ (80 GB)
  • Full set: /data/public/common-crawl/CC-MAIN-2014-10/ (48.6 TB)

Back to FAQ index


What is installed on the Virtual Machine?

On the image we have installed Ubuntu GNU/Linux 14.04. Among other, we installed the following applications:

Back to FAQ index


What are the username and password of the Virtual Machine?

The user account is called naward and has the password award2014. You need this password for tasks that require administrative rights (such as installing additional software).

Back to FAQ index


Where can I get help with Hadoop?

Back to FAQ index


Where can I get help with the Common Crawl data?

Back to FAQ index


Why does it take so long before my job starts running at SURFsara?

The SURFsara Hadoop cluster is a multi-tenant cluster. This means you are sharing the cluster's resources with other users. If at other jobs are occupying all processing power of the cluster, your job will have to wait in queue until there is space again.

Back to FAQ index


Where can I find information about jobs on the SURFsara cluster?

Both the namenode and jobtracker have a web interface. The Firefox browser in the VM already contains bookmarks to these pages.

You need to authenticate with kinit to get access to the web interfaces.

Back to FAQ index