The H speed guide to NoSQL

by Dj Walker-Morgan

What is NoSQL? Why does it seem that every day another company starts using a NoSQL database? Will NoSQL replace SQL? The H Speed Guide to NoSQL answers those questions...

The rise of the NoSQL movement has brought debate back to the database space as the traditional relational model's applicability to all problems has been questioned, not just in theory, but in practical code. At the heart of NoSQL is not a rejection of SQL itself; some have said NoSQL, rather than standing in opposition to SQL as "No SQL", really stands for "Not Only SQL". It more represents a deeper desire to explore database models which have, in the past and for various reasons, been left to languish in obscurity.

But this is not the first time SQL using relational databases have been challenged; in the nineties, Object Oriented Databases turned up and offered what appeared to be a way to relieve developers of converting their data objects into rows and columns. But in practice, the Object Oriented Database implementations were often complex and fragile leaving disillusioned developers to look for better ways again. What emerged were better Object Relational Managers, which took care of more of the work of converting data objects to a relational form whilst keeping the interoperable SQL database at its core.

The difference with the NoSQL movement is that much of the innovation has come from practical developments often leveraging the power of open source development, rather than theoretical positioning. NoSQL is not, though, a panacea for all database needs.

There are some common properties of NoSQL databases; they tend to do away with the rigid schemas of relational databases and use more flexible techniques for defining what data is being held in the database or leave it up to the applications using the database. NoSQL database services also tend to use open protocols for client communication. NoSQL databases are often, but not necessarily, built for scale too, offering the ability to manage large data sets over a cluster of commodity hardware, rather than a very high specification single server.

If we look at the various kinds of NoSQL, we find four major breeds, the document stores, the key/value store, the column oriented databases and the graph stores.

Document stores and CouchDB

Document stores retain documents of any length and allow for retrieval based on the document content. So for example a document may consist of text such as

    "FirstName": "Wallace"

    "Address": "62 West Wallaby Street"

    "Interests": ["cheese","crackers","moon landings"]

CoauchDB logo And a query on the database for FirstName is "Wallace" would return that document. This is unstructured information, and the documents can vary in what fields they contain. XML databases are document oriented databases which contain semi-structured data. For example, CouchDB is a document oriented database which stores documents, in the format you see above. To query in CouchDB, JavaScript based views, which use a simple function to "emit" the fields needed for a view can be defined. Views can include a reduce function which is handed the results of the map as an ordered list and allows for the results to be further filtered or evaluated.

Views themselves are stored as CouchDB documents within the server. The views are regenerated on access or on demand. The idea is to reduce load on the servers resources by having results ready to hand. CouchDB is scaled up by being able to replicate databases between CouchDB servers efficiently. The replication also works with occasionally connected nodes. Ubuntu One, for example, uses CouchDB to synchronise between client PCs and its cloud storage and the BBC uses CouchDB as it's internal NoSQL database, providing storage PDF for its many web services.

To read more about CouchDB check out the CouchDB book or visit the CouchDB site at Apache.org.

Graph databases and Neo4J

Neo4j logo One of the more resource intensive and problematic things to save in a SQL database is information on relationships between things. Traversing a relationship can involve multiple queries. Graph databases focus on representing the graph in the database. Neo4j is an example of this kind of database. Rather than creating records, you create nodes which you then associate with each other by saying what relationship they have. For example, this snippet of Java code:

    Node firstNode = graphDb.createNode();

    Node secondNode = graphDb.createNode();

    Relationship relationship = firstNode.createRelationshipTo( secondNode,

					 MyRelationshipTypes.OWNS );

Creates two nodes and says that a relationship, OWNS (defined elsewhere as a simple enum), exists between the first node and second node. Data about these nodes and relationships is stored as properties of each so:

    firstNode.setProperty("name","Wallace");

    secondNode.setProperty("name","Grommit");

    relationship.setProperty("licence","dog");

says the first node (Wallace) owns (with a dog licence) the second node (Grommit). From this it simple model it is possible to create easily traversable graphs of relationships. FlockDB is another graph database, created by Twitter developers, with their specialised graph of who follows who on the microblogging service. Graph databases can offer a much simpler way of handling information which is primarily concerned with the interconnectedness of things, such as social networks. Although it may be possible to store unlinked data in a graph database, it is not appropriate to do so. More information on Neo4J can be found on the Neo4J website.

Next: Key/Values, Columns and what is NoSQL

1 2 next »

Print Version | Permalink: http://h-online.com/-981275