Apache to Drill for big data in Hadoop
A new Apache Incubator proposal should see the Drill project offering a new open source way to interactively analyse large scale datasets on distributed systems. Drill is inspired by Google's Dremel but is designed to be more flexible in terms of supported query languages. Dremel has been in use by Google since 2006 and is now the engine that powers Google's BigQuery analytics.
The project is being led at Apache by developers from MapR where the early Drill development was being done. Also contributing are Drawn To Scale and Concurrent. Requirement and design documentation will be contributed to the project by MapR. Hadoop is good for batch queries, but by allowing quicker queries of huge data sets, those data sets can be better explored. The Drill technology, like the Google Dremel technology, does not replace MapReduce or Hadoop systems. It works along side them, offering a system which can analyse the output of the batch processing system and its pipelines, or be used to rapidly prototype larger scale computations.
Drill is comprised of a query language layer with parser and execution planner, a low latency execution engine for executing the plan, nested data formats for data storage and a scalable data source layer. The query language layer will focus on Drill's own query language, DrQL, and the data source layer will initially use Hadoop as its source. The project overall will closely integrate with Hadoop, storing its data in Hadoop and supporting the Hadoop FileSystem and HBase and supporting Hadoop data formats. Apache's Hive project is also being considered as the basis for the DrQL.
The developers hope that by developing in the open at Apache, they will be able to create and establish Drill's own APIs and ensure a robust, flexible architecture which will support a broad range of data sources, formats and query languages. The project has been accepted into the incubator and so far has an empty subversion repository.