show what you can do with 6 billion web pages
by SURFsara and Common Crawl
At SURFsara we serve a multitude of users. It is important everybody gets a chance to use the service. We try to keep this process as rule-free as possible because we believe struct regulation is against the nature of scientific experimentation.
The fair-use policy we enforce here is aimed at giving everybody equal chances to use the available capacity. This means we will not allow for monopolization of the Hadoop cluster.
In order to make this process as smooth as possible we have the following tips:
We monitor the cluster continuously. We might contact you if we notice you take an unfair amount of resources. We might also kil jobs that take an unfair amount of resources. But please try not to let it get to that.
We reserve the right to kill your jobs or deny you access at any given time, and at our discretion. This might be necessary if you do not comply with the fair-use policy, but other circumstances might also justify these actions.
Next to the Common Crawl set we provide two test sets. The three sets can be found at:
You're right. Common Crawl exists is divided into three major subsets:
The 2012 set contains 3.8 billion pages in total. At SURFsara we currently have about a third of the 2012 set at the moment. We're currently adding capacity for the remainder of the set but we don't expect to be able to offer the full 2012 set before January 15th 2013. See the stats for more detailed info on the Common Crawl 2012 set.
Common Crawls' Chris Stephens mentions some stats in a comment on the announcement of the 2012 set. Note that we do not have the full 2012 set available.
Total # of Web Documents | 3.8 billion |
Total Uncompressed Content Size | > 100 TB |
# of Domains | 61 million |
# of PDFs | 92.2 million |
# of Word Docs | 6.6 million |
# of Excel Docs | 1.3 million |
TLD | count | relative count |
---|---|---|
com | 2,880,575,573 | 62.88% |
org | 324,888,772 | 7.09% |
net | 285,633,100 | 6.24% |
de | 225,021,051 | 4.91% |
co.uk | 157,660,729 | 3.44% |
ru | 78,841,251 | 1.72% |
info | 76,883,737 | 1.68% |
pl | 68,825,576 | 1.50% |
nl | 68,461,904 | 1.49% |
fr | 62,542,019 | 1.37% |
it | 59,027,654 | 1.29% |
com.au | 41,032,777 | 0.90% |
edu | 36,029,039 | 0.79% |
com.br | 35,458,446 | 0.77% |
cz | 34,635,725 | 0.76% |
ca | 32,767,169 | 0.72% |
es | 31,994,812 | 0.70% |
jp | 28,502,740 | 0.62% |
ro | 26,803,448 | 0.59% |
se | 25,399,890 | 0.55% |
We have not verified these statistics.
On the image we have installed Ubuntu GNU/Linux 12.04. Among other, we installed the following applications:
The user account is called participant and has the password hadoop. You need this password for tasks that require administrative rights (such as installing additional software).
The Virtual Machine comes with Eclipse and a modified version of the Common Crawl examples. The modifications are:
The modified examples are also available in their own repository on Github.
The examples in the Virtual Machine come with an example pig script as well, to illustrate the use of the Pig Loader. This example does a simple count of content types. It looks like this:
register /home/participant/git/commoncrawl-examples/lib/*.jar; register /home/participant/git/commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1.jar; a = LOAD '/home/participant/data/1346864466526_10.arc.gz' USING org.commoncrawl.pig.ArcLoader() as (date, length, type, statuscode, ipaddress, url, html); words = foreach a generate flatten(type) as types; grpd = group words by types; cntd = foreach grpd generate group, COUNT(words); dump cntd;
The SURFsara Hadoop cluster is a multitenant cluster. This means you are sharing the cluster's resources with other users. If at other jobs are occupying all processing power of the cluster, your job will have to wait in queue until there is space again.
Both the namenode and jobtracker have a web interface. The Firefox browser in the VM already contains bookmarks to these pages.
You need to authenticate with kinit
to get
access to the web interfaces.