Norvig Web Data Science Award

show what you can do with 6 billion web pages
by SURFsara and Common Crawl

What's the fair-use policy?

At SURFsara we serve a multitude of users. It is important everybody gets a chance to use the service. We try to keep this process as rule-free as possible because we believe struct regulation is against the nature of scientific experimentation.

The fair-use policy we enforce here is aimed at giving everybody equal chances to use the available capacity. This means we will not allow for monopolization of the Hadoop cluster.

In order to make this process as smooth as possible we have the following tips:

  • Do not wait with running your experiment until the last moment. We will not move the deadline nor make exceptions!
  • You test your code locally. We have included a testset in the VM.
  • Once you're done, test your code on the testset on the cluster. This is a much smaller set and your tests will finish much earlier.
  • Try to use not more than around 100 reducers. Reduce slots won't become available before your job ends, so small jobs might have to wait for your big job.
  • Try to limit your runs on the complete dataset - the more often you will do complete runs, the more likely you will start monopolizing.

We monitor the cluster continuously. We might contact you if we notice you take an unfair amount of resources. We might also kil jobs that take an unfair amount of resources. But please try not to let it get to that.

We reserve the right to kill your jobs or deny you access at any given time, and at our discretion. This might be necessary if you do not comply with the fair-use policy, but other circumstances might also justify these actions.

Back to FAQ index


Where can I find the datasets?

Next to the Common Crawl set we provide two test sets. The three sets can be found at:

  • Testset on the VM: /home/participant/data/* (~163MB)
  • Testset on the cluster: /data/public/common-crawl/award/testset (~6.6GB)
  • Full set: /data/public/common-crawl/parse-output/* (~25TB)

Back to FAQ index


Hey! That's not 6 billion pages! What's going on?

You're right. Common Crawl exists is divided into three major subsets:

  • Crawl data from 2008 / 2010
  • Crawl data from 2009 / 2010
  • Crawl data from 2012

The 2012 set contains 3.8 billion pages in total. At SURFsara we currently have about a third of the 2012 set at the moment. We're currently adding capacity for the remainder of the set but we don't expect to be able to offer the full 2012 set before January 15th 2013. See the stats for more detailed info on the Common Crawl 2012 set.

Back to FAQ index


What are the stats on the Common Crawl 2012 set?

Common Crawls' Chris Stephens mentions some stats in a comment on the announcement of the 2012 set. Note that we do not have the full 2012 set available.

General stats
Total # of Web Documents 3.8 billion
Total Uncompressed Content Size > 100 TB
# of Domains 61 million
# of PDFs 92.2 million
# of Word Docs 6.6 million
# of Excel Docs 1.3 million
Domain Name Page Count breakdown of Top 20 TLD's (these may contain HTTP 404 results)
TLD count relative count
com 2,880,575,573 62.88%
org 324,888,772 7.09%
net 285,633,100 6.24%
de 225,021,051 4.91%
co.uk 157,660,729 3.44%
ru 78,841,251 1.72%
info 76,883,737 1.68%
pl 68,825,576 1.50%
nl 68,461,904 1.49%
fr 62,542,019 1.37%
it 59,027,654 1.29%
com.au 41,032,777 0.90%
edu 36,029,039 0.79%
com.br 35,458,446 0.77%
cz 34,635,725 0.76%
ca 32,767,169 0.72%
es 31,994,812 0.70%
jp 28,502,740 0.62%
ro 26,803,448 0.59%
se 25,399,890 0.55%

We have not verified these statistics.

Back to FAQ index


What is installed on the Virtual Machine?

On the image we have installed Ubuntu GNU/Linux 12.04. Among other, we installed the following applications:

Back to FAQ index


What are the username and password of the Virtual Machine?

The user account is called participant and has the password hadoop. You need this password for tasks that require administrative rights (such as installing additional software).

Back to FAQ index


What MapReduce examples are included in the Virtual Machine?

The Virtual Machine comes with Eclipse and a modified version of the Common Crawl examples. The modifications are:

  • Ported to the 0.20 API
  • Included a working Pig Loader

The modified examples are also available in their own repository on Github.

Back to FAQ index


What Pig example is included in the Virtual Machine?

The examples in the Virtual Machine come with an example pig script as well, to illustrate the use of the Pig Loader. This example does a simple count of content types. It looks like this:

register /home/participant/git/commoncrawl-examples/lib/*.jar; 
register /home/participant/git/commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1.jar;
a = LOAD '/home/participant/data/1346864466526_10.arc.gz' USING org.commoncrawl.pig.ArcLoader() as (date, length, type, statuscode, ipaddress, url, html);
words = foreach a generate flatten(type) as types;
grpd = group words by types; 
cntd = foreach grpd generate group, COUNT(words); 
dump cntd;

Back to FAQ index


Where can I get help with Hadoop?

Back to FAQ index


Where can I get help with the Common Crawl data?

Back to FAQ index


Why does it take so long before my job starts running at SURFsara?

The SURFsara Hadoop cluster is a multitenant cluster. This means you are sharing the cluster's resources with other users. If at other jobs are occupying all processing power of the cluster, your job will have to wait in queue until there is space again.

Back to FAQ index


Where can I find information about jobs on the SURFsara cluster?

Both the namenode and jobtracker have a web interface. The Firefox browser in the VM already contains bookmarks to these pages.

You need to authenticate with kinit to get access to the web interfaces.

Back to FAQ index