Now that you have tested your code and it runs without problems on the test subset it is time to generate your final results by running your application on the cluster on the full dataset.
Before submitting jobs that run on the full dataset, make sure you understand the the guidelines of the fair-use policy. Because of its size, this is even more important for the full dataset than it is for the test subset.
Submitting a job that runs on the full set is simply
a matter of changing the input path of your job. Just change the input
path from /data/public/common-crawl/crawl-data/CC-TEST-2014-10/
to
/data/public/common-crawl/crawl-data/CC-MAIN-2014-10/
. See the FAQ entry
"Where can I find the datasets?".
Like with the test subset you need to authenticate before
submitting. You can do this by opening a terminal and run kinit
USERNAME
.
After you have finished the above, don't forget to submit your results before the deadline!