Yahoo! commits to Apache Hadoop, drops Yahoo! Hadoop
Yahoo! has announced that it is dropping its own distribution of Hadoop and plans to work more closely with the Apache Hadoop community. The Yahoo! distribution had been a vehicle for Yahoo! to experiment with and release its own work on the distributed computing and storage framework, but this appears to have been to the detriment of Apache's Hadoop. "Unfortunately, Apache is no longer the obvious place to go for Hadoop releases" said Eric Baldeschwieler, Yahoo's VP of Software Engineering, adding that Yahoo has always been committed to open sourcing its work. After reviewing the company's options, Yahoo has decided to focus on working with the Apache Hadoop community and to be prepared to compromise on how it achieves its development goals.
The challenge for the company now is how to incorporate "several person-years worth of work" to Apache. Yahoo has two branches, a "sustaining" stable branch which it runs on its 40,000 nodes internally, and a "future" development branch. It has begun merging code from the "sustaining" branch, which Baldeschwieler says is "our most stable and high performance release of Hadoop ever", into a branch at Apache. Once this has community approval on making it into an Apache release, the company plans to move on to integrating it's "future" branch. This features the ability to handle more storage per cluster, a new metrics framework and optimisation for small jobs. Once that is complete, Yahoo! hopes to return to a transparent, regular development cycle and actively synchronise its work with other Hadoop contributors. Baldeschwieler concluded by saying "Our goal is to make Apache Hadoop THE open source platform for big data".