Apache Nutch 2.0 indexes at web scale
The Apache Nutch developers have announced that version 2.0 of the network crawling and indexing search framework is now available. Built on top of other Apache projects including Solr, Tika, Hadoop and Gora, Nutch has been designed to crawl "at web scale" to allow organisations to create searchable indexes of their web-published content. Nutch adds web-specific functionality to Solr with a link-graph database and uses Tika to parse web pages and a number of other document formats.
Nutch 2.0 is an independent branch of Nutch development; in September 2011, the development team decided to focus on the 1.x series for mainstream development, while the 2.x series worked on large scale web crawling development. Nutch 2.0's large scale crawling capabilities have been enabled by adding a storage abstraction layer which then allows it to plug into Apache's Accumulo, Avro, Cassandra, HBase or HDFS big data storage platforms and other SQL-based storage systems. The work on that abstraction led to the creation of Apache Gora, a framework for in-memory data models and big data persistence. Nutch is easily customisable with a plugin architecture supporting modules for document parsing, ranking and architecture.
As an example of Nutch's applications, Apache points to Kalooga, a company which uses Nutch 2.0 in production – with HBase as a backend on a 34-node Hadoop cluster – to provide a visual relevance service for online publishers. Mathijs Homminga, CTO of Kalooga said "The fact that Nutch is implemented on top of Hadoop is essential for us since it allows us to be scalable in storage and processing". Although Nutch has been enhanced so it can scale up, the developers say it still addresses many other use cases, including all the way down to a "small crawl on a single machine".
Nutch 2.0 "shadows the latest mainstream release (1.5.x)" and both are available to download from a range of mirror sites. A detailed list of changes is available in the CHANGES file. The latest release of Nutch 1.5.x, 1.5.1, was also released with five fixes to flaws in its predecessor, 1.5, which was released in April 2012. Apache Nutch is, like all Apache projects, licensed under the Apache 2.0 licence.