show what you can do with 3 billion web pages
by SURFsara and Common Crawl
At SURFsara we serve a multitude of users. It is important everybody gets a chance to use the service. We try to keep this process as rule-free as possible because we believe strict regulation is against the nature of scientific experimentation.
The fair-use policy we enforce here is aimed at giving everybody equal chances to use the available capacity. This means we will not allow for monopolization of the Hadoop cluster.
In order to make this process as smooth as possible we have the following tips:
We monitor the cluster continuously. We might contact you if we notice you take an unfair amount of resources. We might also kill jobs that take an unfair amount of resources. But please try not to let it get to that.
We reserve the right to kill your jobs or deny you access at any given time, and at our discretion. This might be necessary if you do not comply with the fair-use policy, but other circumstances might also justify these actions.
Next to the Common Crawl set we provide a test set. The two sets can be found at:
On the image we have installed Ubuntu GNU/Linux 14.04. Among other, we installed the following applications:
The user account is called naward and has the password award2014. You need this password for tasks that require administrative rights (such as installing additional software).
The SURFsara Hadoop cluster is a multi-tenant cluster. This means you are sharing the cluster's resources with other users. If at other jobs are occupying all processing power of the cluster, your job will have to wait in queue until there is space again.
Both the namenode and jobtracker have a web interface. The Firefox browser in the VM already contains bookmarks to these pages.
You need to authenticate with kinit
to get
access to the web interfaces.