Norvig Web Data Science Award

show what you can do with 3 billion web pages
by SURFsara and CommonCrawl

We provide a Virtual Machine that can be downloaded and used from your own laptop or PC. This environment includes all tools you need for hacking on the Common Crawl dataset.


Here we try to give you a feel for how Hadoop works for processing Common Crawl. We do this using some MapReduce examples, and a Pig example.


Now that you ran a few examples, it's time to start hacking on your ideas!


So you did enough testing? Now you want to get down to the real work? Well then, find out here how you can do a run on the complete dataset!