Bytepawn Marton Trencseni on Software, Systems and other Ideas.

The Confused World of "NoSQL"

2009/11/28

Non-relational datastores are usually thrown together under the umbrella term "NoSQL", which recently just got its own Wikipedia entry. Just as the Wikipedia entry, the world of "NoSQL" is changing quickly. Here I will differentiate the different use-cases and motivations for using and building such systems.



Tower of Babel

"NoSQL" and scalability. The original inspiration for many open-source projects is that large players like Google and Amazon chose not to use Mysql or Oracle in certain cases and developed in-house systems for scalability reasons, meaning storage, availability and performance scalability (see my Readings in Distributed Systems for more on these). There is a number of open-source projects which follow in these steps (like our own Keyspace datastore and our PaxosLease algorithm), trying to build truly scalable software. These systems are "NoSQL" because distributing complex relational operations across nodes is complicated, and these systems are to be used in performance intensive applications anyways, where key-value datastores are a good basis for optimization in the application layer. Unfortunately, the space of truly distributed open-source software is contaminated by virtually every non-relational project claiming to be distributed and scalable.
Systems in this category: Keyspace, Apache Hadoop, Cassandra

"NoSQL" and SQL. Some have claimed that "NoSQL" is not so much about getting rid of SQL as it is about building something scalable; but this is not true in general. The primary use-case here is CouchDB, which is getting a lot of attention from web developers who like its simple, REST-based data model and query facilities. CouchDB is written in Erlang and is fairly slow, so it cannot be taken seriously as a scalable system in the sense above, and most CouchDB users are not scaling their installation. CouchDB is a light-weight solution for web apps, similar to Microsoft Access / Lotus Notes in the bussiness world. Many web developers are non-advanced users of SQL anyway, not using "advanced" features like foregin keys, JOINs, inner queries or stored procedures. These users choose CouchDB (over an ORM layer) to get rid of ugly embedded SQL code which doesn't do much for them anyway. I believe there is a portion of "NoSQL" users who are trying to get rid of SQL.
Systems in this category: Redis, CouchDB

"NoSQL" and UNIX. Many proponents like "NoSQL" data stores because they adhere to the UNIX philosophy of building small, lightweight tools for a specific purpose but of general applicability: "do one thing well". Uncontroversially, relational databases do not adhere to the UNIX philosophy: they are large multi-million line code bases of non-trivial complexity and runtime characteristics. (On the other hand, they have stood the test of time and become the most successful and lucrative data model in computer science). This is one of the reason why projects like Memcached are put under the "NoSQL" umbrella, even though it's not a persistent datastore and as such not competition for a relational database. Also noteworthy are BerkeleyDB and Tokyo, which are low-level key-value storage engines that can be used to build more complex datastores (Mysql used to have a BDB engine, Keyspace 1.x uses BDB).
Systems in this category: Memcached, BerkeleyDB, Tokyo

I believe it would be beneficial to seperate these use-cases and treat them differently (eg. call one NoSQL and the other DDS for Distributed Data Store). While some of them concentrate on implementing good distributed algorithms, others concentrate on making web developer's lifes easier. Both are valid use-cases and noble goals. Clearly, the former may have less trendy API facilities, while the latter may be easier to setup and administer but slower. This is not a put-down, though, it resembles comparing PHP to C++: one of them gives you easy entry and rapid development cycles, while the other gives you raw speed at the expense of compiled code and pointers. But treating them under the same umbrella, as many presentations do, is confusing. It gives the impression of "Pick ONE" or "ONE will survive", which goes against the basic tenet of specialization and UNIX, the starting point of "NoSQL" datastores. Also, people looking to scale terabytes of data may not be interested in CouchDB while people looking to replace Mysql for their simple in-house web apps may not be interested in Hadoop.


- Marton Trencseni


blog comments powered by Disqus