Database Versioning

databases activerecord sequel datamapper

Mon Mar 02 09:35:56 -0800 2009

Migrations bother me.

On one hand, migrations are the best solution we have for the problem of versioning databases. The scope of that problem includes merging schema changes from different developers, applying schema changes to production data, and creating a DRY representation of the schema.

But even though migrations is the best solution we have, it still isn’t a very good one.

Some Problems With Migrations

It’s pretty rare that I’ve seen a mature production database (say, 100+ migrations) that doesn’t have some minor divergence between the contents of schema.rb and the schema in the database. The first time or two it happened I assumed that maybe I or some of the other developers on the project were careless or had used migrations incorrectly. But after seeing it happen a half dozen times on projects built by experienced, careful developers, I’ve come to think that it’s the migrations system which is to blame.

Another problem is that migrations necessarily handle both data and schema changes together. Very often this makes assumptions about the current state of the code (like accessing a model) or data (like assuming records exist in a certain table). Over time you find that you can’t run rake db:migrate on a fresh checkout of the project because it gets hung up on various migrations attempting to manipulate data. The solution is usually to manually remove the code from the migration(s) which touch the data. Ugly.

A small point, but as long as I’m on a roll: the way that ActiveRecord infers the class name and database version from the filename strikes me as iffy. The version especially - important data about the database which can only be found in a filename doesn’t seem like a great practice.

And finally there are the various headaches related to using migrations in development: creating a migration, running it, then after you write some code realizing you need to reverse it, writing code to reverse your original migration, realizing your original migration had some typos so your reverse migration needs to make the same mistakes in reverse…etc.

Compare all of this to the excellent tools we have for managing source code versioning - Git, Subversion, or even CVS. Our database versioning tools are very primitive by comparison to any of these.

Where Does the Schema Live?

In migrations, the canonical schema definition lives in the accumulation of all migrations. schema.rb and the schema stored by the database are both generated outputs. But over the long term, having the DRY authority for schema definition spread out among hundreds of migrations is not highly maintainable.

What about schemaless document-oriented databases, like CouchDB or BigTable? Although I think these are a great idea for various reasons, it does not solve the migration problem. If you, for example, split a name field out into first_name and last_name, you’ll need a data migration even in a schemaless database. An alternative might be to manage versioning at the record level, such that records with different versions can exist simultaneously in the database and be handled appropriately at runtime. Either way, you need a data versioning system.

DataMapper uses inline property definitions in the model code, which I really like - it moves the canonical schema representation front-and-center, instead of hidden off to the side; it shows you the latest version at all times; and it ties the schema directly to the code, which works very well in conjunction with source control.

But DataMapper still needs migrations, and those work exactly like ActiveRecord (and thus have all the same problems). In some ways this is a worse situation, because now you have scripts which manage database versions separate from the property definitions in the models, which means that the DRY authority for schema is now split among two places: the model source, and the migrations. Much as I prefer DataMapper and Sequel as evolutionary improvements on the ActiveRecord-style ORM, when it comes to long-term databases with all the inevitable data and schema changes that come over its lifetime, ActiveRecord is still king.

Solution Brainstorming

The foremost thing I’d like to improve about migrations is the way it is kept DRY: through the accumulation of migrations, rather than schema.rb or some other representation of the current, latest version. This is a bit backwards. Imagine if the way you edited code was by writing patches manually, then applying them. When you realize you wanted to write something different, you write another patch to undo your original patch, then a new patch with your desired changes. It would be tedious and error-prone to say the least, and yet this is exactly what we do with our data migrations.

Here’s some brainstorming on what the next-gen solution might feel like.

Schema

You have a single file in a structured data format, let’s call it db/schema.yml. (schema.rb being executable code has some substantial issues, which I’ve discovered working on yamldb and Taps.)

schema.yml lists out the same info currently found in schema.rb (tables, columns, and indexes), but it is the canonical representation, not a generated file. It is highly structured, meaning there is only one way that you can represent a given schema. So you can assume that two databases with the same schema will always produce schema.yml that is character-for-character the same, which is important for reasons that will become apparent momentarily. The file might look something like this:

---
tables:
  posts:
    title: text
    body: text
    author_id: integer
    created_at: datetime
    updated_at: datetime
  authors:
    name: text
    homepage: text
indexes:
  author_names:
    table: authors
    column: name

Now here’s the important bit: the database version is computed as a SHA1 hash of the schema.yml, the same way that git computes its hash id for commits. The database version is not inferred from the filename or extracted from within the file.

Next, there is a rake command - let’s say rake db:setup. This is the default task that a developer runs when doing a fresh checkout of the project. This loads schema.yml into a fresh database, and ignores migrations. Migrations are only used when transforming one version of the database into another version. The current database version is stored in a metatable named schema_version, just like ActiveRecord.

The application would be strict about the database version syncing with the code. If you try to start the application environment and the hash of schema.yml does not match that stored in your database, it will throw an exception. Perhaps even stricter checking would be possible on SQL databases: on app boot, dump out a temporary schema.yml and diff it against the schema.yml on disk. If they differ, throw an exception. This would clean up all the headaches around I-forgot-to-run-my-migrations which I’ve seen trip up even the most seasoned developers, and expose schema divergence in production closer to when it first happens.

Merging schema changes would be handled by revision control on schema.yml, rather than by writing migrations. Migrations would only be used for bringing forward production data from previous deploys of the code. This means that a project with no production database (i.e., still early development) would have no migrations at all. The presence of migrations implies a production database.

Migrations

Migrations in this solution are very separate from your schema representation. But of course they are necessary for a production database. So your app may have a migrations directory; here you can place scripts that define the data migrations (and schema migrations, for SQL databases), which explicitly name the database versions they migrate between. An example:

migrate :from => '123abc', :to => '456def' do |db|
  db[:authors].add_column :first_name, :text
  db[:authors].add_column :last_name, :text

  db[:authors].each do |row|
    row[:first_name], row[:last_name] = row[:name].split
    row.save
  end

  db[:authors].remove_column :name
end

Some things to note here:

  • We specify both from and to version. No inference of version ordering from the version names. (Not unlike the difference between Subversion’s sequential version numbering and Git’s hash versioning, or CouchDB’s hash keys rather than SQL database sequences.)
  • Access to the application environment is forbidden; there is no dependence on the state of the code. Instead we use raw arrays and hashes, Sequel-style.

You could create a reverse migration (flip the from and the to), but this would be an unusual circumstance. When working locally you never use migrations, but instead modify schema.yml. Once you’ve finalized your schema for the commit, only then do you write your migration. The framework could provide some assistance here, such loading up a previous schema.yml and running your migration against it, and making sure that the resulting dumped schema matches the new schema.yml.

Conclusion

Data versioning is a thorny problem; the solution described above is just speculation, a starting point for discussion. I want to get others out there thinking about this problem.

What would your next-gen database versioning system look like?