Web apps are increasingly focused on background jobs. In fact, the term “background job” almost seems inaccurate - the heavy lifting done by worker processes is often the meat of the app’s purpose. The web portion of the app, by comparison, does only the relatively lightweight work of putting job requests into queues, and later presenting the results of jobs as HTML or JSON.
I’ve previously written about queueing via Delayed Job. DJ uses your database as its backend, which is a great way to start, but doesn’t scale well.
I’ve also described Minion backed by RabbitMQ for a more robust queueing solution. While I love Minion’s simple jobs DSL, RabbitMQ can feel like overkill for apps that aren’t huge distributed systems. AMQP is a complex protocol with lots of capabilities outside the scope of job queueing. These capabilities become dead weight for most apps, which only need a way to enqueue and work jobs. I find this especially poignant when I’m building an app that uses Sinatra, Redis, and Memcache. RabbitMQ’s ponderous footprint doesn’t fit in with these nimble backend daemons.
Discovering Beanstalk
Ilya Grigorik pointed me toward Beanstalk, a job queueing backend inspired by Memcache. It’s simple, lightweight, and completely specialized on job queueing. They use it at PostRank to process millions of jobs a day, so it does perform at scale.
I’ve found Beanstalk to be a joy to use. The difference between RabbitMQ and Beanstalk reminds me of the difference between Apache and Nginx, or between Squid and Varnish. It gives 80% the functionality with 20% the weight and complexity. The authors have definitely achieved their goal of making a job queueing backend which has the same clean simplicity as memcached.
Installation
On Mac OS X, install Beanstalkd like this:
$ sudo port install beanstalkd
(Or build from source.)
Running it couldn’t be simpler:
$ beanstalkd
Stalker, a Minion-like Job Queueing DSL
The Ruby beanstalk client is extremely simple - put a string onto a queue, pull it off later. This is great, but it’s just a smidge too unstructured for my taste. So I wrote Stalker, a DSL almost identical to Minion, but for Beanstalk.
Enqueue jobs like so:
Stalker.enqueue('email.send', :email => 'joe@example.com')
In a jobs.rb file, define a how to work each job:
include Stalker
job 'email.send' do |args|
Pony.email(:to => args['email'], :subject => "Hello there!")
end
Now you can run one or more worker processes to work your jobs. Stalker includes a handy binary:
$ stalk jobs.rb
[Sat Apr 17 14:13:40 -0700 2010] Working 3 jobs :: [ email.send twitter.post image.resize ]
By default, it will work all jobs you’ve defined. But you can also filter it down to a list by specifying job names on the command line:
$ stalk jobs.rb email.send,twitter.post
[Sat Apr 17 14:13:40 -0700 2010] Working 2 jobs :: [ email.send twitter.post ]
This will allow you to run one pool of workers for fast or high-priority jobs, and another pool for general work.
Features for Job Queueing
Though lightweight, Beanstalk’s laser-sharp focus on its singular purpose of job queueing allows it to deliver many features extremely useful for that purpose. For example:
- Priorities - Give a number from 0 to 1000 when queueing a job and it will jump ahead of all jobs already enqueued with a higher number.
- Persistence - Although beanstalkd stores its jobs in memory for speed and simplicity (ala memcached or redis-server), it can also save its state to a file so that you can cycle the beanstalkd process without losing any jobs.
- Federation - Fault-tolerance and horizontal scalability is provided the same way as Memcache - through federation by the client. Take a look at how the Ruby client handles multiple beanstalkd servers, it’s really quite clever.
- Buried jobs - When a job causes an error, you can bury it. This keeps it around for later introspection and debugging (or even re-running it), while keeping it separated from active jobs.
- Timeouts - The default behavior for jobs not acknowledged by a client (by deleting it when finished) to re-queue. This prevents failed jobs (particularly from a client that loses its connection partway through the job) from getting lost, the same purpose of ack in AMQP. Delayed Job uses its locked_at and locked_by fields for this purpose, but it’s very easy for a worker which doesn’t exit cleanly to leave jobs in a jammed/stuck state. Beanstalk’s reserve, work, delete cycle, with a timeout to dereserve the job, means it’s impossible for a bad client to prevent a job from completing.
Beanstalk’s features are described in more detail on the FAQ.
Performance
Beanstalk feels very snappy overall. I ran some off-the-cuff benchmarks against a handful of Ruby-friendly queueing systems on my laptop, and here were my results:
enqueue | work | |
---|---|---|
delayed job | 200 jobs/sec | 120 jobs/sec |
resque | 3800 jobs/sec | 300 jobs/sec |
rabbitmq | 2500 jobs/sec | 1300 jobs/sec |
beanstalk | 9000 jobs/sec | 5200 jobs/sec |
Don’t take these numbers too seriously, as I didn’t make any attempt to be rigorous or simulate real-world conditions. But they do give some quantitative support to my sense that Beanstalk is smokin’ fast.
Wrapup
A port of QFeedreader to Stalker requires only a few lines of code changed, but we get to cut out a tons dependency gems required for the AMQP backend. Judged by weight of dependencies removed, switching to Beanstalk/Stalker looks favorable.
One thing still lacking in the Beanstalk community are good introspection tools - something that, so far, only Resque has made much progress on. Some command-line tools exist, which indicate that the Beanstalk protocol has all the introspection capabilities necessary. So building a user-friendly interface introspection interface (command line or web) seems entirely possible.
Another thing missing from Beanstalk is authentication. The authors probably assume that you’re running in a traditional environment with IP/firewall-based access control, but this doesn’t jive with cloud environments. Memcached recently added SASL to solve this. I asked about this on the mailing list and it seems the Beanstalk author(s) are open to this possibility.
Lastly, I note that right now the only queueing system available as a service is Amazon SQS. Beanstalk would be make a beautiful multitenant cloud service - very similar to the way MongoHQ is running MongoDB as a service. I sense there is a great opportunity here for someone to found a Beanstalk-as-a-service startup.