Norvig Web Data Science Award

show what you can do with 3 billion web pages
by SURFsara and CommonCrawl

SURFsara and Common Crawl share the vision that to study the web should be possible for everybody. This is based on the believe that data, tools, and knowledge needed for this purpose should be open. As a part of this vision, the Norvig Web Data Science Award aims to promote this knowledge, these tools, and the data.

Timeline for submissions

The deadline for the 2014 Norvig Web Data Science Award is 11:59pm, August 31, 2014. The award is open for applications starting May 2014. Make sure to carefully read all the instructions below, including the entry criteria and proposal format.

How many pages in the Common Crawl dataset are spam?

What are the most controversial pages in Common Crawl?

To what extend does the deep web appear in Common Crawl?

How wide are networks of linked pages discussing a certain event?

What kind of pages link to social networks?

Examples of questions you might answer by looking at the Common Crawl dataset

The Norvig Web Data Science Award is an award for students and researchers studying at or employed by a research institute or university in the Netherlands. It is a challenge in which participants show what they can do with the Common Crawl dataset - a snapshot of a large part of the web - using SURFsara’s Hadoop service to provide big data compute power.

Review process

Submissions of results will be reviewed by our jury after the deadline. The participants will be notified of the results before September 31, 2014. The winners will be announced at an award ceremony in October 2014. The exact date and location will be announced shortly.

The name of the award

The award is named after Peter Norvig, Google’s director of research with a resume too impressive to summarize. Peter is on the advisory board of Common Crawl.

Peter Norvig

Peter Norvig

Peter Norvig is a Fellow of the American Association for Artificial Intelligence and the Association for Computing Machinery. At Google Inc he was the Director of Search Quality, responsible for the core web search algorithms from 2002-2005, and has been a Director of Research from 2005 on.

Previously he was the head of the Computational Sciences Division at NASA Ames Research Center, making him NASA’s senior computer scientist. He received the NASA Exceptional Achievement Award in 2001. He has taught at the University of Southern California and the University of California at Berkeley, from which he received a Ph.D. in 1986 and the distinguished alumni award in 2006. He was co-teacher of an Artifical Intelligence class that signed up 160,000 students, helping to kick off the current round of massive open online classes. He has over fifty publications in Computer Science, concentrating on Artificial Intelligence, Natural Language Processing and Software Engineering, including the books Artificial Intelligence: A Modern Approach (the leading textbook in the field), Paradigms of AI Programming: Case Studies in Common Lisp, Verbmobil: A Translation System for Face-to-Face Dialog, and Intelligent Help Systems for UNIX. He is also the author of the Gettysburg Powerpoint Presentation and the world’s longest palindromic sentence.


Jimmy Lin

Jimmy Lin

I'm an associate professor in the iSchool at the University of Maryland, with appointments in the Institute for Advanced Computer Studies (UMIACS) and the Department of Computer Science. I joined the faculty in August 2004, shortly after completing my Ph.D. in Electrical Engineering and Computer Science at MIT, and was promoted to associate professor in March 2009.

I work on "big data", with a particular focus on large-scale distributed algorithms for text processing. My research lies at the intersection of natural language processing (NLP) and information retrieval (IR). I'm a member of both the Computational Linguistics and Information Processing Lab (CLIP) and the Human-Computer Interaction Lab (HCIL).


Arjen P. de Vries

Arjen P. de Vries

I am the group leader of the Centrum Wiskunde & Informatica (CWI) research group Interactive Information Access (INS2), the main topics of my research include structured document retrieval and entity ranking, multimedia information retrieval, the application of information retrieval theory to recommendation systems and social media, nearest neighbour search in high dimensional spaces, the integration of information retrieval and database technology, and the evaluation methodology needed in these novel information retrieval application areas. I am very interested into the newly proposed dataspaces abstraction.

Since September 2008 I am Full Professor Multimedia Dataspaces at Delft University of Technology. Finally, I am co-founder of the recent CWI spin-off Spinque.


Evert Lammerts

Evert Lammerts

Evert Lammerts is co-founder and CEO of Lucipher, a company that is building the first end-to-end encrypted and verifiably secure data platform in the cloud. Before starting down this exciting road, he built Hadoop-based services at SURFsara from the ground up. He is founder of the Netherlands Hadoop User Group, lectured on subjects related to Hadoop at universities and conferences, was the initiator of the first Peter Norvig Web Data Science Award and chaired the content selection of the Operating Hadoop track at the Hadoop Summit Europe in 2013.