Building graphs with Hadoop
Faced with a mass of unstructured data, the first step of analysing it should be to organise it, and the first step of that process should be working out in what way it should be organised. But then that mass of data has to be fed into the graph which can take a long time and may be inefficient. That's why Intel has announced the release of the open source GraphBuilder library, a tool that is meant to help scientists and developers working with large amounts of data build applications that make sense of this data.
The library plugs into Apache Hadoop and is designed to create graphs from big data sets which can then be used in applications. GraphBuilder is written in Java using the MapReduce parallel programming model and takes care of many of the complexities of graph construction. According to the developers, this makes it easier for scientists and developers who do not necessarily have skills in distributed systems engineering to make use of large data sets in their Hadoop applications. They can focus on writing the code that breaks the data up into meaningful nodes and useful edge information which can be run across the distributed architecture where the library also performs a wide range of other useful processes to optimise the data for later analysis.
In a whitepaper released by the Intel engineers who designed GraphBuilder, examples such as internet connectivity models, social network relationships, and genetic data are given as fields which could make use of this kind of processing. GraphBuilder should make it easier to analyse data like this with machine learning algorithms, which can be applied to the generated graphs. A team from the University of Washington has created a tool called GraphLab to do this type of analysis, but they discovered that constructing large enough graphs to feed into their tool was still a weak spot. The team was "constantly writing scripts to construct different graphs from various unstructured data sources", according to the Intel Labs blog post that introduces GraphBuilder.
These scripts were run on a single machine and executing them would take a long time. Since graph building is an activity that benefits from parallelisation, the developers decided to create a library for Hadoop, which they identified as being a very good fit for graph construction. Besides the actual graph construction, GraphBuilder also takes care of the cleaning up, compressing, partitioning and serialisation of said graphs. According to Intel, "GraphBuilder makes it possible for a Java programmer to build an internet-scale graph for PageRank in about 100 lines of code".
Source code for GraphBuilder is available from Intel's 01.org site. The library is licensed under the Apache 2.0 license and is currently in beta.