Graph Databases

Mon Mar 15 21:30:15 -0700 2010

Graph databases are a type of datastore which treat the relationship between things as equally important to the things themselves. Examples of datasets that are natural fits for graph databases:

Friend links on a social network
“People who bought this also bought…” Amazon-style recommendation engines
The world wide web

In graph database parlance, a thing (a person, a book, a website) is referred to as a “node,” while a relationship between two things (a friendship, a related book, an href) is referred to as an “edge.”

In most types of databases, the records stored in the database are nodes, and edges (relationships) are derived from a field on a node. In a SQL database, for example, you might have a table called “people” that includes a field “friend_id.” friend_id is a reference to another record in the people table.

The weakness with reference fields becomes apparent as soon as you want to do many-to-many relationships, or store data about the relationship. A person can have many friends; and you might want to track the date the friendship link was created, or whether the two people are married.

The solution to this in a SQL database is a join table. In the people/friends example, your join table might be called “friendships”. But this method has some weaknesses. One is that it can greatly increase the number of tables in your database, and may make it hard to tell apart standard tables (nodes) from join tables (edges) - which makes it more difficult for new developers to comprehend the database architecture. Another problem is that ORMs, which work quite well for mapping node (model) tables, generally have a much harder time mapping edges. (Witness all the thrashing about that happened during the development of has_and_belongs_to_many :through in ActiveRecord.)

But the biggest weakness is that queries against relationship data - be it in join table or a reference link - are extremely unwieldy. In a SQL database it typically leads to recursive joins, which tend to lead to long, incomprehensible SQL statements and unpredictable performance.

A graph database is designed to represent this type of information, so it models the data more naturally. It’s also designed to query it: you can walk the data in a convenient and performant manner.

I’ve yet to try using a graph database, but the concept is intriguing. It’s yet another reminder that not every data modeling problem can be solved with the same hammer.

a tornado of razorblades

Graph Databases