Norvig Web Data Science Award

What's the fair-use policy?

At SURFsara we serve a multitude of users. It is important everybody gets a chance to use the service. We try to keep this process as rule-free as possible because we believe struct regulation is against the nature of scientific experimentation.

The fair-use policy we enforce here is aimed at giving everybody equal chances to use the available capacity. This means we will not allow for monopolization of the Hadoop cluster.

In order to make this process as smooth as possible we have the following tips:

Do not wait with running your experiment until the last moment. We will not move the deadline nor make exceptions!
You test your code locally. We have included a testset in the VM.
Once you're done, test your code on the testset on the cluster. This is a much smaller set and your tests will finish much earlier.
Try to use not more than around 100 reducers. Reduce slots won't become available before your job ends, so small jobs might have to wait for your big job.
Try to limit your runs on the complete dataset - the more often you will do complete runs, the more likely you will start monopolizing.

We monitor the cluster continuously. We might contact you if we notice you take an unfair amount of resources. We might also kil jobs that take an unfair amount of resources. But please try not to let it get to that.

We reserve the right to kill your jobs or deny you access at any given time, and at our discretion. This might be necessary if you do not comply with the fair-use policy, but other circumstances might also justify these actions.

Back to FAQ index

Where can I find the datasets?

Next to the Common Crawl set we provide two test sets. The three sets can be found at:

Testset on the VM: /home/participant/data/* (~163MB)
Testset on the cluster: /data/public/common-crawl/award/testset (~6.6GB)
Full set: /data/public/common-crawl/parse-output/* (~25TB)

Back to FAQ index

Hey! That's not 6 billion pages! What's going on?

You're right. Common Crawl exists is divided into three major subsets:

Crawl data from 2008 / 2010
Crawl data from 2009 / 2010
Crawl data from 2012

The 2012 set contains 3.8 billion pages in total. At SURFsara we currently have about a third of the 2012 set at the moment. We're currently adding capacity for the remainder of the set but we don't expect to be able to offer the full 2012 set before January 15th 2013. See the stats for more detailed info on the Common Crawl 2012 set.

Back to FAQ index

What are the stats on the Common Crawl 2012 set?

Common Crawls' Chris Stephens mentions some stats in a comment on the announcement of the 2012 set. Note that we do not have the full 2012 set available.

General stats
Total # of Web Documents	3.8 billion
Total Uncompressed Content Size	> 100 TB
# of Domains	61 million
# of PDFs	92.2 million
# of Word Docs	6.6 million
# of Excel Docs	1.3 million

Domain Name Page Count breakdown of Top 20 TLD's (these may contain HTTP 404 results)
TLD	count	relative count
com	2,880,575,573	62.88%
org	324,888,772	7.09%
net	285,633,100	6.24%
de	225,021,051	4.91%
co.uk	157,660,729	3.44%
ru	78,841,251	1.72%
info	76,883,737	1.68%
pl	68,825,576	1.50%
nl	68,461,904	1.49%
fr	62,542,019	1.37%
it	59,027,654	1.29%
com.au	41,032,777	0.90%
edu	36,029,039	0.79%
com.br	35,458,446	0.77%
cz	34,635,725	0.76%
ca	32,767,169	0.72%
es	31,994,812	0.70%
jp	28,502,740	0.62%
ro	26,803,448	0.59%
se	25,399,890	0.55%

We have not verified these statistics.

Back to FAQ index

What is installed on the Virtual Machine?

On the image we have installed Ubuntu GNU/Linux 12.04. Among other, we installed the following applications:

Back to FAQ index

What are the username and password of the Virtual Machine?

The user account is called participant and has the password hadoop. You need this password for tasks that require administrative rights (such as installing additional software).

Back to FAQ index

What MapReduce examples are included in the Virtual Machine?

The Virtual Machine comes with Eclipse and a modified version of the Common Crawl examples. The modifications are:

Ported to the 0.20 API
Included a working Pig Loader

The modified examples are also available in their own repository on Github.

Back to FAQ index

What Pig example is included in the Virtual Machine?

The examples in the Virtual Machine come with an example pig script as well, to illustrate the use of the Pig Loader. This example does a simple count of content types. It looks like this:

register /home/participant/git/commoncrawl-examples/lib/*.jar; 
register /home/participant/git/commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1.jar;
a = LOAD '/home/participant/data/1346864466526_10.arc.gz' USING org.commoncrawl.pig.ArcLoader() as (date, length, type, statuscode, ipaddress, url, html);
words = foreach a generate flatten(type) as types;
grpd = group words by types; 
cntd = foreach grpd generate group, COUNT(words); 
dump cntd;

Back to FAQ index

Where can I get help with Hadoop?

Hadoop API documentation of Hadoop.
Search-Hadoop.com, a Hadoop-specific search engine.
Tip Become a member of the hadoop-user mailing list
Tip Become a member of the pig-users mailing list
Read the Pig manual
Use Jimmy Lin and Chris Dyer's excellent book, Data-Intensive Text Processing with MapReduce
The MapReduce tutorial from Hadoop's homepage

Back to FAQ index

Where can I get help with the Common Crawl data?

Have a look at the Common Crawl website
Ask at the Common Crawl mailinglist

Back to FAQ index

Why does it take so long before my job starts running at SURFsara?

The SURFsara Hadoop cluster is a multitenant cluster. This means you are sharing the cluster's resources with other users. If at other jobs are occupying all processing power of the cluster, your job will have to wait in queue until there is space again.

Back to FAQ index

Where can I find information about jobs on the SURFsara cluster?

Both the namenode and jobtracker have a web interface. The Firefox browser in the VM already contains bookmarks to these pages.

You need to authenticate with kinit to get access to the web interfaces.

Back to FAQ index

Norvig Web Data Science Award

Frequently asked questions

What's the fair-use policy?

Where can I find the datasets?

Hey! That's not 6 billion pages! What's going on?

What are the stats on the Common Crawl 2012 set?

What is installed on the Virtual Machine?

What are the username and password of the Virtual Machine?

What MapReduce examples are included in the Virtual Machine?

What Pig example is included in the Virtual Machine?

Where can I get help with Hadoop?

Where can I get help with the Common Crawl data?

Why does it take so long before my job starts running at SURFsara?

Where can I find information about jobs on the SURFsara cluster?