The H Half Hour: 10Gen CTO Eliot Horowitz
by Dj Walker-Morgan
MongoDB is one of the most visible NoSQL databases out there and 10Gen's CTO is apparently one of the most hands-on coding CTOs out there. So when he was in London recently, The H just had to have a chat with Eliot Horowitz about his technical philosophy of what MongoDB is, where its going and how being an active developer informs his decision-making process:
The H: What do you think of when you think of MongoDB in the enterprise?
EH: The way I view the world from a high level is what makes MongoDB interesting to people who are used to using Oracle, and, besides the price issue and Oracle's pricing model, the fundamental difference is the document model. You've got the document model versus the relational model. The document model lets you develop more easily. It lets you be more agile, it lets you scale better and it lets you have better performance in many cases. For a large range of applications that data model and the consequences of that are incredibly compelling. And so developers in the enterprise, who are no different from other developers, want to be more productive and be able to scale more easily. The fact that running it is probably a lot cheaper than Oracle for similar applications certainly doesn't hurt.
The H: Some relational databases are taking steps into the schema-less document model space. How do you view that?
EH: Whether or not there's a schema is in some ways in my mind orthogonal. I think at some point Mongo could have a schema. An obvious step to me would be a server-side document validator so any document inserted has to meet it, if not it's rejected. Is that a schema? It's kind of like a schema, it's pretty close to a schema. Does that betray any Mongo fundamentals? I think absolutely not. It is kind of odd.
Because it's a document model it's easier not to have a schema, but in some cases it can be very helpful, especially in enterprises – where you may have an ops team not sitting next to the dev team or five dev teams using the same data – a schema can be quite useful. And if you make the tool you run simply with different versions of the schema, it could actually be quite elegant. So I don't think schema-less is what it's about. I can imagine a relational database with a lighter schema. PostgreSQL is doing interesting things in that space. I really think its more about the overall data model rather than schema vs schemaless.
The H: We've noted experiments with relational databases and documents which decant "hot" elements of documents into columns of the database. Does that answer any issues?
EH: The document model matters because the query languages understand hierarchy, and not just hierarchy but once you have one to many relationships embedded in a document, that's where those things tend to break down. For any relational database, one row is going to equal one or zero entries in an index, and its a pretty big shift to thinking one entry in the database could equal zero to n entries in an index and what does that mean to the query engine and how will the semantics mesh up. That's why I think it won't fit well into a relational model; if someone figures out how to do it, it'll be interesting – that's the fundamental challenge. You can start with JSON and do queries on it but having the database natively understanding hierarchy and making the query language very expressive is challenging for relational databases.
That's the real benefit of MongoDB as a document engine; the native understanding of that hierarchy and having a query language that intuitively understands that. We're investing a lot in that scenario too; you want to make it easy to store hierarchy, to really understand what those structures means and how can the tree be modified without affecting other parts of the tree. And that's what we are focusing on to make it really nice.
The H: So what in particular are you working on
EH: Right now we're doing a whole bunch of work on both the query side and update side. We're working on those aggressively, doing a lot of refactoring, because the one thing we want to do is add a lot more update operators more easily. We've got fifteen now, we think we need a hundred. Adding each one now takes a mid-level developer about a week. Our design goal is to get it so a junior engineer can do it in a day. This will open up the opportunity to do lots more update operators that are more complex.
The H: What about the query side of things?
EH: The query side is a little more difficult. The really challenging part is what I've been working on, which is how do these operators interact with the query optimiser so that if I add a new operator and implement particular methods, the query optimiser will know how to understand it. Then I can add the query operators more easily, thereby making the query optimiser a lot smarter. The query optimiser is a little too simple and we need to do more intelligent things like index intersections – we can make it a lot smarter.
We have an interesting optimiser and we won't be changing the fundamentals of it; we don't analyse the data statistics, we analyse the execution. Some of this is personal experience, some of this is research, but statistical optimisers can get people into trouble when the statistics or the query plan change suddenly, so we take the flipside – let's say there are three possible query plans for a given query, we will actually try all three for a little while and see who is going to win and we'll choose that one and remember that. It's all runtime statistics that we feed back into the system. We think it's a better model, a little complicated to do but we wrote it four years ago and its done surprisingly well, but it's time for version two.