In my QCon talk Horizontal Scalability via Transient, Shardable, Share-Nothing Resources, I argued that memcached is the father of modern shardable resources. Today’s NoSQL key-value stores all owe some part of their inspiration to memcached. Even feature-rich datastores such as CouchDB or Cassandra also borrow a cornerstone idea from memcached: throw away some features historically associated with databases in order to make big gains in scalability and resiliency.
Memcached was created to be a cache, as its name implies. But developers eventually discovered that it was useful for storing many types of transient data, such as sessions, page-view counters, or API rate limiting counters.
App developers storing data in memcached instead of their SQL database? Does that mean that memcached can be classified as a type of database system?
To answer that question, we have to work our way back to a definition for the family of software typically referred to as “databases.” I’m going to use the term datastore, because it seems more natural when applied to modern NoSQL options. (For simplicity’s sake, let’s assume that datastore, database, database system, and DBMS are all roughly synonymous.)
Here’s my definition:
A datastore is software that stores atomic chunks of data known as records, and allows those records to be retrieved later.
Datastores are a superset that includes relational databases, graph databases, key-value stores, and document databases. DBM, Tokyo Cabinet, Redis, S3, MySQL, PostgreSQL, CouchDB, MongoDB, Neo4j, and Hadoop are all part of this big happy family. Now onto the question of whether memcached belongs here as well.
Many would argue that memcached should be disqualified from being considered a datastore on account of its transience.
My definition above says that you can retrieve the data you’ve stored later. But what’s the duration of “later”? We expect datastores to be persistent - if they aren’t, what’s the point? But persistence does not have to be forever. It only needs to last as long as the application logic requires.
MongoDB offers capped collections and Redis offers expiring keys; in both of these cases, the fact that the data does not persistent forever is a feature. Memcached is a datastore which has extreme transience as a feature. How many times have application developers written nightly cron jobs to clean up old session data from their SQL datastore? Using memcached, you can skip this extra garbage-collection step. Memcache is a good fit for data that you want to last for a little while, but not forever.
Memcached set an early example for many patterns now prevalent in NoSQL. It got us thinking about how we can make trade-offs between datastore features and ease of scaling. Memcached occupies the far extreme of this spectrum: it trades away almost every feature we associate with database systems, keeping just the bare minimum, and in return it gets blinding speed and near-infinite horizontal scalability. That trade proved to be a worthwhile one, as memcached is now a critical piece of infrastructure for many of the world's largest web apps.
The memcached case is a great example of how NoSQL is broadening how we think about data storage and retrieval. This has opened us up to a variety of specialized datastores: memcached, S3, and Hadoop, to pick some very successful examples. Each of these occupies a unique (and often very large) niche in the data storage space. We’ve learned that not all data is the same; the proliferation of options for how we store and retrieve our data is a natural consequence.