Norvig Web Data Science Award

show what you can do with 3 billion web pages
by SURFsara and CommonCrawl

Now that you have tested your code and it runs without problems on the test subset it is time to generate your final results by running your application on the cluster on the full dataset.

Submit job on full set

Before submitting jobs that run on the full dataset, make sure you understand the the guidelines of the fair-use policy. Because of its size, this is even more important for the full dataset than it is for the test subset.

Submitting a job that runs on the full set is simply a matter of changing the input path of your job. Just change the input path from /data/public/common-crawl/crawl-data/CC-TEST-2014-10/ to /data/public/common-crawl/crawl-data/CC-MAIN-2014-10/. See the FAQ entry "Where can I find the datasets?".

Like with the test subset you need to authenticate before submitting. You can do this by opening a terminal and run kinit USERNAME.

Write your report

After you have finished the above, don't forget to submit your results before the deadline!