09 November 2012, 12:38

Facebook open sources its MapReduce successor

Facebook logo

Facebook has open sourced its Corona scheduling component for Hadoop, which the company calls "the next version of Map-Reduce". Facebook is using its own fork of Apache Hadoop which is optimised for the massive scale of its operations.

The current Hadoop implementation of the MapReduce technique uses a single job tracker, which causes scaling issues for very large data sets. The Apache Hadoop developers have been creating their own next-generation MapReduce, called YARN, which Facebook engineers looked at but discounted because of the highly-customised nature of the company's deployment of Hadoop and HDFS.

Corona, like YARN, spawns multiple job trackers (one for each job, in Corona's case). This has several advantages over traditional MapReduce implementations, according to Facebook. The improved job handling leads to better scalability and lower latency, especially when working with large data sets. The company also claims that Corona improves cluster utilisation with a "generally fair scheduler" and lower scheduling overhead.

Since Facebook's situation is uncommon, with several petabytes of data being added to its clusters every few days, and since Corona is explicitly tailored to its infrastructure, users will have to use the company's version of Hadoop with Corona. For users who want to give the software a try, there are instructions available on how to set up a Corona node on a single machine.

Facebook's version of Hadoop, including Corona, is licensed under the Apache 2.0 License and the source code for Corona is available from GitHub.

See also: