Blekko donates 81 terabytes of data to Common Crawl
The search engine Blekko has contributed 81 Terabytes of data to the Common Crawl Foundation. The data consists of ranking metadata gathered from crawling web sites between February 2012 to November 2012 and will be used by the Common Crawl Foundation to improve the quality of its crawling by allowing it to more easily avoid "webspam, porn and the influence of excessive SEO".
The Common Crawl Foundation was founded by Gil Elbaz to produce and maintain an open repository of web crawl data. The idea was that, by allowing widespread access to this data, the web would be more democratic. "As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better" says the Foundation.
To access the Foundation's data, users can use an Amazon Machine Image to run against the Common Crawl data; instructions on the process are available on the web site. The data itself is stored as JSON in Hadoop, SequenceFiles on Amazon S3; as of October 2012, there were around six billion web documents stored. Other statistics are available.
Blekko's CTO, Greg Lindahl, says "we’re helping Common Crawl because Common Crawl is taking strides towards our shared vision of an open and transparent Internet". The complete data being donated includes information on 140 million web sites and 22 billion web pages.