SURFsara and Common Crawl share the vision that to study the web should be possible for everybody. This is based on the believe that data, tools, and knowledge needed for this purpose should be open. As a part of this vision, the Norvig Web Data Science Award aims to promote this knowledge, these tools, and the data.

How many pages in the Common Crawl data are spam?

What are the most controversial pages in Common Crawl?

To what extend does the deep web appear in Common Crawl?

How wide are networks of linked pages discussing a certain event?

What kind of pages link to social networks?

The Norvig Web Data Science Award is an award for students and researchers studying at or employed by a research institute or university in the Netherlands. It is a challenge in which participants show what they can do with the Common Crawl dataset - a snapshot of a large part of the web - using SURFsara’s Hadoop service to provide big data compute power.

Review process

Submissions of results will be reviewed by our jury, and participants will be notified of the results before February 22, 2013.


The winner gets a tablet (type TBA), and 1500 Euro to spend on travel, accommodation, and conference registration for SIGIR 2013, for one person, to be held in Dublin, Ireland. The winner is also expected to give a lightning talk at Hadoop Summit Amsterdam, and gets a free access pass for the whole event. If applicants have entered as a group and the group produced the winning entry, the prize will be awarded to whomever the group decides, but each prize is awarded to a single person only.

Award Ceremony

The award ceremony will be held on March 18 2013 at the University of Twente.

The name of the award

The award is named after Peter Norvig, Google’s director of research with a resume too impressive to summarize. Peter is on the advisory board of Common Crawl, and is part of the jury for this award.

Lisa Green is the Director of the Common Crawl Foundation where she oversees the foundation’s mission of building, maintaining and openly disseminating a comprehensive crawl of the web. Common Crawl’s 130TB corpus of over 8 billion web pages enables innovation in education, research, and business. Prior to Common Crawl, she was the Chief of Staff at Creative Commons. Lisa holds a PhD in physical chemistry from the University of California Berkeley, lives in San Francisco, and is passionate about open systems and big data.