SQL Databases Are An Overapplied Solution (And What To Use Instead)

Wed Jul 08 11:33:39 -0700 2009

Relational databases are a great solution for storing a certain type of data that you’ll need to access in a certain way:

Small records with complex, well-defined, highly normalized relationships.
The type of queries you will be running on the data is largely unknown up front.
The data is long-lived: it will be written once, updated infrequently relative to the number of reads/queries, and deleted either never, or many years in the future.
The database does not need round-the-clock availability (middle-of-the-night maintenance windows are no problem).
Queries do not need to be particularly fast, they just need to return correct results eventually.
Data integrity is 100% paramount, trumping all other concerns, such as performance and scalability.

Because SQL databases proved to be such an excellent tool within this narrow but often useful role, developers found themselves reaching for the SQL database for any job at all. It’s the default, the catch-all for storing nearly anything the application developer can think of - and even some things that aren’t data storage at all. When all you have is a hammer, etc.

Overapplication of the solution has given SQL databases a bad rap. Instead, let’s shift our thinking: use SQL databases for what they are good for, and select a different tool for other kinds of jobs.

Here’s a few scenarios, and the tool you might use for them instead of a SQL database.

Document Records

A document record is a bundle of losely defined data, typically represented as a hash (potentially with keys that point to other nested hashes and arrays). There are many examples of data that in a SQL database would be split across many tables and joined together at retrieval time that could be better represented as a single document. (The Peepcode CouchDB screencast preview illustrates this in a visual way.)

Some examples of document records include: a post on a blog; a user profile on a social network or dating site; and an order on an e-commerce site or in a shipping and receiving system. I find it easiest to visualize these as JSON, which not coincidentally is exactly how document databases often store the data.

For example, here’s what a user profile record might look like on a social network:

{
  'username': 'adamwiggins',
  'full_name': 'Adam Wiggins',
  'joined_on': '21 May 2009'
  'last_login': '02 Jul 2009'
  'website': 'http://adam.blog.heroku.com/'
  'interests': [ 'ruby', 'entrepreneurship', 'dynamism' ],
}

Another example, an e-commerce order:

{
  'shipping_name': 'Joe Smith',
  'shipping_address': '123 Main St',
  'shipping_city': 'Anytown',
  'shipping_state': 'CA',
  'shipping_zip': '12345',
  'ordered_at': 'Thu, 02 Jul 2009 19:40:04 GMT',
  'items': [
    { 'product_code': 'AX5718', 'qty': '1', 'price_per_unit': '5.00' },
    { 'product_code': 'BB9388', 'qty': '3', 'price_per_unit': '2.00' },
  ],
  'tax': '0.00',
  'shipping: '10.00',
  'total': '21.00'
}

In a SQL database, you would break data like into multiple tables connected by relationships and retrieved with joins. But in practice, and the app almost always wants to retrieve the entire record all at one, and write the entire record all at once. (Many of the leaky abstraction problems in ActiveRecord have to to with the fact that it’s trying to manipulate a set of relational tables as if they were a single, atomic document.)

For document records of this nature, what you want is a document database. CouchDB and MongoDB are two front-runners in this realm, and Tokyo Tyrant has a document mode (table engine) as well. These manipulate documents as single units, but also allow construction of indexes and running queries on subelements - for example, finding out how many people have ordered a given product.

Binary Assets

The filesystem (including distributed filesystems like GFS) and SQL blobs are both wrong for this, tempting though they may appear.

For file uploads to a web application, or serving media like video or mp3s, an asset store like S3 or CloudFiles is the right tool. For internal storage of binary data blobs, try Tokyo Tyrant.

Transient Data

Transient data includes sessions, metrics, and logs. Storing this kind of data in a SQL database is a road to madness.

Sessions are something that expire - either automatically based on a expiration date, or explicitly, when they are explicitly ended by the user completing their transaction or logging out of a website.

Metrics don’t expire in such a direct fashion, but they are generally collected at small intervals, then collapsed into averages for the longer term. In some cases it may make sense to collect your minute-by-minute metrics in a non-SQL datastore and then summarize by day or week into a SQL database. However, metrics generally have no relationships to track, so the main strength of relational databases are of no help here.

Logs combine some of each of the properties of sessions and metrics. One example of logging to a database that I’ve seen over and over is sites that track each pageview as a SQL record. (Years ago I consulted for a banner ad company that was storing every single clickthrough as a record; they had 450 million records, and did all sorts of table-partitioning gymnastics to make the data queryable.) Generally, logs are something that should be expired based on the arrival of more logs, a ring buffer.

Logs and metrics also are frequently written and infrequently read. Which is exact opposite of what SQL databases are good at.

Two excellect choices for transient data are Memcached (implicit expiration based on LRU) and Redis (explicit expire times, and handy append operations).

Memcached's INCR and DECR and Redis' INCR, INCRBY, DECR, and DECRBY are a great way to get what you really want from metrics: a count. SQL databases are very poor for counting things, because you either store each hit as a record and do a count() or sum(column) later on; or do an UPDATE value=value+1 WHERE id=whatever, which on some implementations is the same as an INSERT plus a DELETE, a very heavy-weight set of operations for something as simple as a count. Check out INCR next time you want to count things.

Locking

MySQL’s GET_LOCK and PostgreSQL’s advisory locks can be used for system-wide locks. I’ve actually seen relatively few problems caused by use of these, but it does still reflect the tendency for databases to grow into swiss army knives, rather than staying focused on what they do best.

I would contend that with the right architecture, system-wide locks aren’t needed at all. Perhaps I’ll cover that in a future post.

Interprocess Communication

It’s convenient to use the SQL database as a message bus, since all your processes already have access to it. Delayed::Job does this, to enable queues without needing to run a daemon, which I think makes sense as a way to get started easily. But as you get more serious with message-based concurrency, you’ll want to upgrade to a true message bus.

RabbitMQ is the one we use internally at Heroku, so I’m rather partial to it. Some others include SQS, Beanstalkd, and Kestrel.

Conclusion

SQL databases don't scale, but that’s ok. If we restrict our use of SQL database to only what they are good at - small, infrequently-written, long-lived records with complex data relations - and start storing all our other data in more appropriate places, we won’t need to push our SQL databases to such high levels of performance or scale.

a tornado of razorblades