Key/Value stores and Redis
The key/value database is notionally even simpler; a key of some description points to a value, which at it's simplest could be an arbitrary string. Key/value stores have, for a long time, been embedded within applications. Unix stalwarts such as dbm, gdbm and Berkley DB are key/value stores. What changes with NoSQL key/value stores is that, usually but not always, they make the key/value store a standalone database service which is accessible with web techniques such as REST.
Within the key/value stores, there are some sub-classes, the in-memory variants which retain their data in memory for performance, and the on-disk versions which save their data directly to disk. The in-memory variants are useful as distributed cache mechanisms, while the on-disk versions are used for data storage.
Redis is one example of a on-disk key/value store which goes beyond storing simple string values by allowing for lists, sets and sorted sets to be stored. Redis keeps the whole data set in memory, persisted to disk using one of two techniques, for overall performance. For each of the data types Redis supports, there is a set of specialised commands which are exposed in APIs for a wide range of languages. Redis is used by Engine Yard, Github and Craigslist for their data storage needs. For more on Redis, consult the Redis project site.
Column-oriented databases and Cassandra
A typical SQL database is concerned with rows, each row representing a record of fields. This view is comfortable for humans as it maps easily to how we tend to record things in say – a card index. But consider a card index with a sales per employee on the card. To get a total sales, you would have to step through each card, retrieve it, add the sales figure and move on to the next card. SQL databases can be optimised for this case, but as the volume of data gets bigger, the longer it can take. Now, consider if, when you were writing the sales figure for each employee down, you also added it to a list of sales figures. When it came to calculate total sales, just adding up the numbers on that list would produce the required total. This much faster process is, in essence, what a column oriented database is about. It stores its data such that it can be rapidly aggregated with less I/O activity.
The column oriented database has typically been used data mining and analytics applications, where the storage method is optimal for the common operations performed on the data. Column oriented databases, by their nature tend to be part of a hybrid of classic relational databases and the column oriented technology.
Take the Apache Cassandra database which is a blend of column orientation and key/value store, which offers a decentralised and "eventually consistent" distributed database. Originally developed by Facebook, now developed as an Apache project and used by Digg, Twitter, Reddit and others, Cassandra's column values have a name, value and a time stamp. These columns can be grouped as a ColumnFamily, which is analogous to a relational database's table. The columns can also be tagged as SuperColumns, which can be retrieved in time stamp order rather than key order.
The Cassandra hybrid model, in combination with it's scalability, makes it a useful alternative for situations where there is a large amount of active, near real time data being stored; hence it's use behind social networking sites which are an exemplar of that situation. To find out more about Cassandra, consult the Apache project page.
So what is a NoSQL database?
As you can see from the four examples, NoSQL databases are focussed on particular classes of problems; from being more flexible about stored data (document stores) to targeting use cases like relationships (graph databases) and aggregating data (column databases) or just simplifying the idea of a database down to something that stores a value (key/value stores).
The SQL based relational database model still has many years ahead of it for traditional data storage applications, but, partly because of their open source development model and partly because of their regular deployment in production services, NoSQL databases offer rapidly improving alternatives or complements to those SQL based silos of information. The SQL databases still retain their advantages in terms delivering ACID and relational capabilities, which makes them still the best choice for many traditional enterprise uses. SQL database technology isn't standing still either; Ingres, for example, is putting the VectorWise technology into it's SQL database, to give it the edge against the column stores for data analysis.
What NoSQL databases represent is a new disruptive force in the data storage and retrieval business, applying techniques which have often been ignored because they didn't fit with how standard SQL managed things. Now we are moving to a time of a more mixed and much richer data storage and retrieval ecosystem.
- NoSQL Databases, an index of NoSQL databases.
- Apache CouchDB 0.11.0 loses the alpha/beta tag, a report from The H.
- VMware hires key Redis developer, a report form the H.
- Apache Cassandra 0.6 released, a report from The H.