Cloudera releases third version of integrated Hadoop distribution
Hadoop specialist Cloudera has announced the general availability of CDH3 (Cloudera's Distribution of Hadoop, version 3). CDH3 is a data management platform based on Apache's Hadoop project. Hadoop itself is a software framework designed for data-intensive distributed applications. It was inspired by Google's publications on its MapReduce and Google File System technology and has since developed an ecosystem of projects which use the framework for massive scale data processing under the umbrella of the Apache Software Foundation.
Cloudera is a venture capital backed company which focuses on selling support and services around Hadoop, and CDH is the company's collation and curation of the Hadoop technology and ecosystem. CDH2 included Hadoop, Hive for performing SQL-like queries on large data sets, and the Pig dataflow language and compiler. CDH3 now also includes projects such as the HBase database, Apache Zookeeper's distributed coordination service, Apache Whirr cloud support, the Hue web browser front end for Hadoop, the Oozie workflow engine, Sqoop database integration and Flume data collection.
CDH3 has been in beta test with enterprise clients of Cloudera, and according to Mike Olson, Cloudera CEO, that includes "some of the most demanding production environments in the world". Despite the enterprise level testing, Cloudera is emphatic that CDH3 is open source: "Make no mistake, this is a pure open source software stack, 100 per cent Apache licenced", said Olson.
Performance has been enhanced in CDH3 over previous versions, partly due to newer Linux distributions' more efficient kernels. According to Cloudera, small MapReduce jobs run three times faster and file system I/O is 20 per cent faster. The company also notes 2x improved performance in HBase queries. CDH3 also includes hundred of bugfixes which have been backported to the various packages. A new ODBC driver is included to improve BI (Business Intelligence) client integration and support for incremental import/export from relational databases has been incorporated. CDH3 also sees the Hadoop authentication and security model extended to all the other components for a more consistent security model.
CDH3 is available for Red Hat, CentOS, SUSE, and Ubuntu Linux distributions and works with 64-bit and 32-bit Java. A detailed list of new features in CDH3 is available, along with a list of changes that introduce incompatibities with previous versions. Hadoop versions and patching information is also available. The previous version's support for Debian Lenny and Ubuntu Hardy (8.04 LTS), Jaunty (9.04) and Karmic (9.10) has been dropped in CDH3, but RHEL 6.0, SLES 11 and Ubuntu Lucid (10.04) and Maverick (10.10) have been added to the supported host platform list. Users can download source as a tarball, installable code as an rpm or deb, or as a virtual machine image for installation in the cloud.