a tornado of razorblades

Applying the Unix Process Model to Web Apps

2011-05-09T09:22:53-07:00

The unix process model is a simple and powerful abstraction for running server-side programs. Applied to web apps, the process model gives us a unique way to think about dividing our workloads and scaling up over time.

Process model basics

Let’s begin with a simple illustration of the basics of the process model, using a well-known unix daemon: memcached.

Download and compile it:

$ wget http://memcached.googlecode.com/files/memcached-1.4.5.tar.gz
$ tar xzf memcached-1.4.5.tar.gz 
$ cd memcached-1.4.5
$ ./configure
$ make

Run the program:

$ ./memcached -vv
...
<17 server listening (auto-negotiate)
<18 send buffer was 9216, now 3728270

This running program is called a process.

Running manually in a terminal is fine for local development, but in a production deployment we want memcached to be a managed process. A managed process should run automatically when the operating system starts up, and should be restarted if the process crashes or dies for any reason.

We can use a process manager to put processes under management. There are many process managers, but operating systems usually have defaults. On OS X, launchd is the built-in process manager; on Ubuntu, Upstart the built-in process manager.

Let’s set up memcached to run as a managed process on Ubuntu. Write an Upstart config:

/etc/init/memcached.conf

description "Memcached"
exec /usr/bin/memcached >> /var/log/memcached.log
start on runlevel [345]
respawn

We can now tell Upstart to start our process for the first time:

$ start memcached
memcached start/running, process 1212

The memcached process is now running in the background, managed by the process manager, with its output stream going to /var/log/memcached.log.

Now that we’ve established a baseline for the process model, we can put its principles to work in more novel way: running a web app.

Mapping the unix process model to web apps

A server daemon like memcached has a single entry point, meaning there’s only one command you run to invoke it. Web apps, on the other hand, typically have two or more entry points. Each of these entry points can be called a process type.

A basic Rails app will typically have two process types: a Rack-compatible web process (such as Webrick, Mongrel, or Thin), and a worker process using a queueing library (such as Delayed Job or Resque). For example:

Process type	Command
web	bundle exec rails server
worker	bundle exec rake jobs:work

A basic Django app looks strikingly similar: a web process can be run with the manage.py admin tool; and background jobs via Celery.

Process type	Command
web	python manage.py runserver
worker	celeryd --loglevel=INFO

Process types differ for each app. For example, some Rails apps use Resque instead of Delayed Job, or have multiple types of workers. Every app needs to declare its own process types.

Declaration of process types is conceptually similar to declaration of dependencies. In the Ruby world, Gem Bundler and the Gemfile give us a declarative, canonical way to specify the gem dependencies for an app. We need the equivalent of Gemfile and Bundler, but for process types.

Procfile, a format to declare your process types

Procfile is an extremely simple file format which allows you to declare the process types your app uses. Its format is one process type per line, with each line formatted as:

<process type>: <command>

A Rails app might have a Procfile like this:

web:    bundle exec rails server -p $PORT
worker: bundle exec rake jobs:work

One purpose for this is structured documentation - a developer can view the Procfile to see the app’s process architecture, just as they can view the Gemfile to see its dependencies. But the greater utility of Procfile lies in our ability to parse the file and run the app’s processes automatically.

Foreman, a process manager for local development

Foreman is a handy command-line tool written by David Dollar. It reads your Procfile and runs one process for each process type declared by your app.

Install it:

$ gem install foreman

If you’ve written a Procfile (such as the one shown in the previous section) and put it in the root of your app, you can now run it like this:

Foreman runs one process for each process type that we’ve declared. Once running, the output streams for each running process are conveniently interleaved in the foreground on our terminal. Each line is prefixed with a timestamp and the name of the running process, and color-coded by which process emitted which line.

Foreman is a process manager in the same sense as launchd or Upstart, but tailored to the needs of app development. It runs only a single app at a time, with all processes in the foreground, and terminates if any process crashes or if you press Ctrl-C.

Using Procfile for deployment

Bundler has a --deployment command-line option, allowing you to use your app’s Gemfile to set up gems on your production server. Procfile and Foreman can be used in a similar fashion, using a feature of Foreman to export to a process manager format of your choice.

For example, let’s deploy a Procfile-backed Rails app to an Ubuntu server, selecting Upstart as the export format. As root, run the following from wherever your Procfile is located:

$ foreman export upstart /etc/init
[foreman export] writing: /etc/init/myapp.conf
[foreman export] writing: /etc/init/myapp-web.conf
[foreman export] writing: /etc/init/myapp-web-1.conf
[foreman export] writing: /etc/init/myapp-worker.conf
[foreman export] writing: /etc/init/myapp-worker-1.conf
$ start
myapp start/running, process 28572

Your app is now running as two managed processes. You can use all of Upstart’s control capabilities, such as restarting the app when deploying a new release of your code:

$ restart myapp
myapp start/running, process 28591

Process types vs processes

To scale up, we’ll want full grasp of the relationship between process types and processes.

A process type is the prototype from which one or more processes are instantiated. This is similar to the way a class is the prototype from which one or more objects are instantiated in object-oriented programming.

Here’s a visual aid showing the relationship between processes (on the vertical axis) and process types (on the horizontal axis):

Processes, on the vertical axis, are scale. You increase this direction when you need to scale up your concurrency for the type of work handled by that process type. Foreman lets you specify concurrency for each process type when you export with the -c option. To get a process formation matching the diagram, you’d use this command:

$ foreman export upstart /etc/init -c web=2 -c worker=4 -c clock=1

Process types, on the horizontal axis, are workload diversity. Each process type specializes in a certain type of work.

For example, some apps have two types of workers, one for urgent jobs and another for long-running jobs. By subdividing into more specialized workers, you can get better responsiveness on your urgent jobs and more granular control over how to spend your compute resources.

Scheduling work at a certain time of day (e.g., the equivalent of cron) can be achieved with a specialized process type: a library like resque-scheduler or Clockwork can be run as a singleton process for a very flexible cron replacement. Consuming the Twitter streaming API is another type of specialized work best served by a singleton process.

Pulling all of these potential use cases together, here’s an example of a Procfile for an app with five process types: a Sinatra web app, two types of Resque workers, a singleton clock with Clockwork, and a singleton ruby script consuming the Twitter streaming API:

web:          bundle exec ruby web.rb -p $PORT
fastworker:   QUEUE=urgent bundle exec rake resque:work
slowworker:   QUEUE=*      bundle exec rake resque:work
clock:        bundle exec clockwork clock.rb
tweetscan:    bundle exec ruby tweetscan.rb

When we run this Procfile with Foreman, we’ll give five processes - one for each process type. In production, we can use Foreman’s concurrency argument to fan out to dozens or even hundreds of running processes, potentially spread out across multiple machines.

Conclusion

The unix process model is a powerful way to approach running your web app. Procfile gives us a way to declare process types, and Foreman gives us an easy way to run the app’s processes in both development and deployment environments.

How To Scale a Development Team

2011-04-28T09:56:38-07:00

As hackers, we’re familiar with the need to scale web servers, databases, and other software systems. An equally important challenge in a growing business is scaling your development team.

Most technology companies hit a wall with dev team scalability somewhere around ten developers. Having navigated this process fairly successfully over the last few years at Heroku, this post will present what I see as the stages of life in a development team, and the problems and potential solutions at each stage.

Stage 1: Homebrewing

In the beginning, your company is 2 - 4 guys/gals working in someone’s living room, a cafe, or a coworking space. Communication and coordination is easy: with just a few people sitting right next to each other, everyone knows what everyone else is working on. Founders and early employees tend to be very self-directed so the need for management is nearly non-existent. Everyone is a generalist and works on a little bit of everything. You have a single group chat channel and a single all@yourcompany.com mailing list. There’s no real need to track any tasks or even bugs. A full copy of the state of the entire company and your product is easily contained within everyone’s brain.

At this stage, you’re trying to create and vet your minimum viable product, which is a fancy way of saying that you’re trying to figure out what you’re even doing here. Any kind of structure or process at this point will be extremely detrimental. Everyone has to be a generalist and able to work on any kind of problem - specialists will be (at best) somewhat bored and (at worst) highly distracting because they want to steer product development into whatever realm they specialize in.

Stage 2: The first hires

Once you’ve gotten a little funding and been able to hire a few more developers, for a total of 5 - 9, you may find that the ad-hoc method of coordination (expecting to overhear everything of importance by sitting near teammates) starts to break down. You have both too much communication (keeping tabs on six other people’s work is time-consuming) and too little communication (you end up colliding on trying to fix the same bug, answer the same support email, or respond to the same Nagios page).

At this point, you want to add just a sprinkle of structure: maybe an iteration planning on Monday, daily standups, and tracking big to-do items and bugs on a whiteboard or in a simple tool like Lighthouse. Perhaps you switch to a support system like Zendesk where incoming support requests can be assigned, and you add a simple on-call rotation for pages via Pagerduty. Your single internal chat and email channels continue to work fine.

Resist the urge to introduce too much structure and process at this point. Some startups, on reaching this stage, declare “we’ve got to grow up and act like a real company now” and immediately try to switch to heavy-handed tactics. For example: full-fledged SCRUM, heavyweight tools like Jira, or hiring a project manager or engineering manager. Don’t do that stuff. You’ve got a team that works well together in an ad-hoc way; you probably have some natural leaders on the team who direct a lot of the work while still being hands-on themselves; and while your product is launched and in the hands of users, in many ways you’re still trying to figure out what your company is really all about. Introducing bureaucracy into this environment is almost guaranteed to block you from doing what you’re really supposed to be doing, which is pivoting in search of your scalable business model.

Focus at this stage is key. Everyone is still a generalist, but the whole development team should be aligned behind a single goal (aka milestone) at a time. If you try to attack multiple battlefronts at once, and you’ll do everything badly. Great companies are more likely to die indigestion from too much opportunity than starvation from too little. Pick your battles carefully and stay focused.

Crisis on the brink of Stage 3

Grow to 10 - 15 developers, and you’re on the verge of a major team structure change. I’ve been told that many promising startups have been killed by failing to weather the transition between these stages.

With this many developers, iteration planning, standups, or any other kind of development-team meeting has become so big that the attendees spend most of their time bored. Any individual developer will find it difficult to find a sense of purpose or shared direction in the midst of trudging through laundry lists of details on other people’s work.

In programming, when a class or sourcefile gets to big, the solution is to break it down into smaller pieces. The same principle holds for scaling a development organization. You need to break into targeted teams.

Stage 3: Breaking into teams

Dividing your single team of generalists is harder than it sounds. Draw the fences in the wrong place, and you’ll create coordination problems that make things even worse. Find the right places to divide and you’ll see a massive increase in focus, happiness, and productivity.

The key to a good team is a well-defined sphere of authority, with clear interfaces to other teams. The team should own the vision and direction for the part of your product that it works on. It should be able to operate with maximum autonomy on everything it owns without having to ask for permission or information from other teams, except for the infrequent case of a feature or bug that crosses team boundaries.

A close mapping between your software architecture and your team architecture will be a big help here. By this time you have probably already converted your monolithic application into a distributed system of multiple components communicating over REST, AMQP, or other RPC mechanism. (And if not, you should strongly consider doing so, coincident with your dev team split.) There should be an obvious mapping between software components - each of which has their own source repository and deployment location/procedure - and your nascent teams.

Deciding what person goes on what team will be somewhat arbitrary at first. My approach was to sit down with each developer and dig in to try to understand what parts of the system they were most passionate about working on. From there I divided up the teams as best I could. Some people found perfect homes on their first team assignment, others were dissatisfied and needed to transfer to another team fairly quickly. Over time, the team territories became very well-defined, so it became much easier to slot new hires in the right place. Let developers follow their own passions and they will gravitate toward the team where they will do the best work.

Separately, you should have found your product/market fit by this point. If you’ve grown to this size and are still figuring out your company’s meaning for existence, you’ve got big problems. If that’s the case, stop growing, and scale back down until you nail product/market fit.

Specialization

Another reason to break into teams is specialization. Types of engineering specialists include ops engineers/sysadmins, infrastructure developers, front-end web developers, back-end web developers, business engineers / data analysts, and developers who focus on a particular language. Language specialists are becoming more common, because many internet-scale companies write high-concurrency components in functional programming like Erlang, Scala, or Clojure, generally handled by a different set of developers than the authors of the Ruby, Python, or PHP web components.

Early on, specialists are rarely desirable. There’s too many different layers to work on in delivering a software product relative to the number of people available to contribute, so the everyone pitches in on everything. This may put a developer doing such far-ranging work such as ops projects like kernel updates on the OS, to front-end projects like writing JQuery effects for the UI.

Once you reach the point where you’ve got a dozen developers, your product has reached a level of usage and maturity where the problems are getting much harder. Scaling the database is something that is not only a full-time job, but requires a deep level of specialized knowledge that can’t be acquired if that person is also simultaneously learning to be a JQuery expert and an iOS expert and an Erlang expert.

You need people who can and are willing to focus on just a few closely related areas so that they can build very deep knowledge in those areas. Some of these will be your existing generalists deciding to specialize, and some will be new hires. You can now hire for the kind of specialist that would not have been appropriate when your company was smaller. Generalists are always useful to have around, and some of them may move into management - filling business owner roles for a team, rather than hands-on development.

Heroku's first teams

Heroku’s initial team breakdown looked like this:

API - Owns our user-facing web app and the matching Heroku client gem.
Data - Builds and runs our PostgreSQL-as-a-service database product.
Ops - Shepherds and protects availability of the production system.
Routing - Manages everything necessary to get HTTP requests routed to user web processes.
Runtime - Handles packaging code for deploy and starting/stopping/managing user processes.

Each of these teams owns between one and five components. For example, the API team owns the Rails app which runs at api.heroku.com and the Heroku client gem. The Data team owns the provisioning and monitoring tool for our database service, as well as all of the individual running databases. (Peter van Hardenburg was the intrapreneur who founded and now leads our Data team. He tells a bit of that story in the later part of this video.)

Team size and roles

For us, the ideal team layout has been two developers and one business owner. One developer is not enough over the long term (they need a second pair of eyes on the code, and besides, one is a lonely number). Three developers works fine as well. Get to four or five and things start to become a bit crowded; there may not be enough surface area for them to all work without stepping on each others’ toes constantly. Almost all of Heroku’s teams have two developers.

“Business owner” is a somewhat clumsy term, but it’s the best we’ve come to describe the person doing some combination of product management, project management, and general management for the team. The business owner fills the important role of knowing the business value of the team’s work to the company and how it fits in with the larger product. They can broker cross-team communication, help prioritize projects and tasks by business value, and may provide status reports on the team’s progress or presentations to the senior executives and/or the entire company to justify the team’s ongoing existence.

I’m a fan of hacker-entrepreneurs in the business owner role: a strong technical background means they have an in-depth understanding of the work being done, and are able to command huge respect from those whose work they are directing. This sort of person is not necessarily available for all teams, but find them when you can. In many cases it involves quite a bit of convincing to get a hacker to give up coding as their primary function.

Avoid having developers belong to more than one team. They are makers and need to be able to focus their full attention on their team’s current projects without distraction or attempts at multitasking. Business owners, however, can sometimes belong to multiple teams. It’s not always a full-time job, and there are benefits to cross-team communication by having one person be a business owner for two or more related teams.

Cohesion

In the earlier stages, you should avoid attacking on multiple battlefronts, and instead keep all developers focused around a single goal for the company. With creation of fiefdoms for each team, this has changed. Now you can and should attack on multiple battlefronts. Each team should be executing independently against its own goals, and not worrying too much about what other teams are doing.

It’s awesome to be able to pursue three, four, five big goals simultaneously. A few months after breaking into teams at Heroku, we had a day where three different teams were all releasing major new features. It’s an incredible feeling.

But now you have a new problem: lack of cohesion. Your decentralized teams are setting their own roadmaps and deciding on features independently. But to avoid fragmentation in your product, someone needs to decide an overall direction and set of product values. More succinctly: you need a strategy.

But this post is long enough as it is. I’ll save discussion of cohesion and strategy for another time.

Ephemeralization

2011-04-07T12:45:11-07:00

Paul Graham’s essay on tablets referenced a fascinating term I hadn’t heard before: “ephemeralization.” Wikipedia describes it as “the ability of technological advancement to do ‘more and more with less and less until eventually you can do everything with nothing’.”

An example: video playback technology

Fifty years ago, the only option for watching video was an entire movie theater, with a huge projector fixed in place, and film reels the size of barrels for a single movie. The 1980s gave us VCRs and VHS tapes: a playback device that you could carry with two hands and that offered more features than a movie theater (like pause, fast forward, and rewind); the tapes were small enough that you could keep a reasonably sized movie library on your bookshelf. In the 1990s we got DVD players and DVDs, shrinking the playback device yet smaller, shrinking the movie media (tapes -> DVDs) yet smaller, and offering yet more features (like higher resolution).

In the 2000s Playstaions, XBoxes, and computers appeared with built-in DVD players, shrinking the playback device to nothing (it became part of a device you already owned). And in 2010, with Netflix streaming, you have instant access to tens of thousands of movies without needing any physical media at all. The playback device and media have both shrunk to have no corporeal representation whatsoever (hence “ephemeralized”), yet you have access to movie movies and more features for playback of those movies than ever in the past.

Ephemeralization at Heroku

Heroku is a company built on the premise that running a software as a service can be ephemeralized. Where Netflix streaming eliminates dedicated playback devices and media, Heroku eliminates servers, routers, and most or all systems administration.

Ephemeralization is a core value of our engineering and product design approaches. I believe this has been a big part of our success: internally, it helps us succeed at building a scalable, maintainable infrastructure; and externally, it helps us succeed at offering a lean product which has not turned into a swiss army knife despite ever-expanding capabilities.

Machete, not a swiss army knife

One of our core values on product design is that we want to create a machete, not a swiss army knife. A machete is a simple tool that has wide application to many tasks. A swiss army knife is a complex tool that has specialized gadgets for each task you might want to perform. (See James' startup school talk, about 12m in, for further elaboration.)

Some examples of user-facing ephemeralization Heroku has executed:

Switching from our custom gems manifest to the community, off-the-shelf solution of Gem Bundler. This allowed us to maintain much less code and documentation, while offering a more sophisticated gem dependency system. Rails 3 comes with a Gemfile out of the box, so to the user’s perspective, the effort of declaring your version of Rails as a gem dependency has disappeared.
Our new logging system merges all logs into a single stream, so logs:cron is no longer a separate codepath. Users can still filter to just their cron logs, but this is done via a general-purpose filtering interface.
Those of you who have been using Heroku since the beginning will recall we once had a single sign-on system (known as heroku_user) that allowed you to use your Heroku user login to log into your app. This was a cool feature, but in the end the maintenance cost was high and the gain low. More general-purpose, standards-based solutions such as Google Apps login became commonplace (with tools like Warden to help Rubyists use them), so we removed heroku_user altogether and let users roll their own.

Each of these changes gave our platform a more machete-like user experience.

Ephemeralizing infrastructure

Internally, we’re always looking for opportunities to reduce or eliminate infrastructure.

One example of this was when we switched from a specialized server type for our main database (the one that contains our user, app, and billing records) to a database running on the same system we use to provision databases for Heroku user apps. Self-hosting gets us more leverage out of our existing database management and monitoring tools.

We also look to replace internally-built tools with off-the-shelf solutions whenever we can. Two examples of this are when we switched from a custom-built pager system to PagerDuty, and switching from a custom-built logging system to syslog and Splunk.

Each of these changes made our ability to manage and scale our infrastructure substantially easier. Fewer moving parts means less to keep track of, less to worry about, and less to go wrong.

Applying ephemeralization at your company

If you decide that you’d like to apply this principle at your company, how do you do it?

Everyone is your company should be constantly pushing to do more with less. This means being willing to look at every component, every user-facing feature, and every line of code with a critical eye. Some questions you should be constantly asking:

What can replace with third-party solutions? (like Heroku did with gems manifest -> Bundler, custom pager -> PagerDuty, and custom logger -> syslog/splunk)
What user-facing features can be merged together to create a more machete-like UX? (like Heroku did with cron:logs -> logs)
Where can we generalize an existing system in order to have it take over the duties of a more specialized system? (like Heroku did with our specialized database server -> self-hosted database)
What can we eliminate completely when its cost vs benefit analysis comes up short? (like Heroku did with our heroku_user single sign-on system)

Proposals to ephemeralize a component or feature will sometimes be met with emotionally-charged responses from your team. It’s totally reasonable to feel attached to a component everyone has worked hard on and has been important historically. But realize that the component has value because it got you to where you are today, not necessarily because of its ongoing existence in the future.

Everything Heroku has ever ephemeralized out of our infrastructure was part of our journey to the product and infrastructure we have today. I don’t regret for a moment all the time I spent coding on our single sign-on solution, our old cron log fetcher, our gems manifest, or any of a host of other things that are either gone or are fading out of our product today.

Referencing to video example again: DVDs were a fantastic bit of innovation and brought the world forward into the modern age of movie-watching. But our love of DVDs shouldn’t be a blocker to us adopting new technologies with more capabilities and a smaller footprint, like on-demand streaming video.

Conclusion

Every month, Heroku strives to do more with less. More: users, traffic, capabilities, and versatility for the users. Less: lines of code, components, moving parts, APIs, server types, tools. Ephemeralization is how we keep our product and our infrastructure lean and nimble over the long term.

Logs Are Streams, Not Files

2011-04-01T07:29:49-07:00

Server daemons (such as PostgreSQL or Nginx) and applications (such as a Rails or Django app) sometimes offer a configuration parameter for a path to the program’s logfile. This can lead us to think of logs as files.

But a better conceptual model is to treat logs as time-ordered streams: there is no beginning or end, but rather an ongoing, collated collection of events which we may wish to view in realtime as they happen (e.g. via tail -f or heroku logs --tail) or which we may wish to search in some time window (e.g. via grep or Splunk).

Using the power of unix for logs

Unix provides some excellent tools for handling streams. There are two default output streams, stdout and stderr, available automatically to all programs. Streams can be turned into files with a redirect operator, but they can also be channeled in more powerful ways, such as splitting the streams to multiple locations or pipelining the stream to another program for further processing.

A program that uses stdout for its logging can easily log to any file you wish:

$ mydaemon >> /var/log/mydaemon.log

(Typically you would not invoke this command directly, but would run this from an init program such as Upstart or Systemd.)

Programs that send their logs directly to a logfile lose all the power and flexibility of unix streams. What’s worse is that they end up reinventing some of these capabilities, badly. How many programs end up re-writing log rotation, for example?

Distributed logging with syslog

Logging on any reasonably large distributed system will generally end up using the syslog protocol to send logs from many components to a single location. Programs that treat logs as files are now on the wrong path: if they wisht to log to syslog, each program needs to implement syslog internally - and provide yet more logging configuration options to set the various syslog fields.

A program using stdout for logging can use syslog without needing to implement any syslog awareness into the program, by piping to the standard logger command available on all modern unixes:

$ mydaemon | logger

Perhaps we want to split the stream and log to a local file as well as syslog:

$ mydaemon | tee /var/log/mydaemon.log | logger

A program which uses stdout is equipped to log in a variety of ways without adding any weight to its codebase or configuration format.

Other distributed logging protocols

Syslog is an entrenched standard for distributed logging, but there are other, more modern options as well. Splunk, fast becoming a indispensable tool for anyone running a large software service, can accept syslog; but it also has its own custom protocol which offers additional features like authentication and encryption. Scribe is another example of a modern logging protocol.

Programs that log to stdout can be adapted to work with a new protocol without needing to modify the program. Simply pipe the program’s output to a receiving daemon just as you would with the logger program for syslog. Treating your logs as streams is a form of future-proofing for your application.

Logging in the Ruby world

Most Rack frameworks (Sinatra, Ramaze, etc) and Rack webservers (Mongrel, Thin, etc) do the right thing: they log to stdout. If you run them in the foreground, as is typical of development mode, you see the output right in your terminal. This is exactly what you want. If you run in production mode, you can redirect the output to a file, to syslog, to both, or to any other logging system that can accept an input stream.

Unfortunately, Rails stands out as a major exception to this simple principle. It creates its own log directory and writes various files into it; some plugins even take it upon themselves to write their own, separate logfiles. This hurts the local development experience: what you see in your terminal isn’t complete, so you have to open a separate window with tail -f log/*.log to get the information you want. But it hurts the deployment experience even more, because you end up having to tinker around with a bunch of Rails logger configuration options to get your logs from all your web machines to merge into a single stream.

Logging on Heroku

The need to treat application logs as a stream is especially poignant with Heroku's new logging system. On the backend, we route logs with a syslog router written in Erlang called Logplex.

Logplex handles input streams (which we call “sinks”) from many different sources: all the dynos running on the app, system components like our HTTP router, and (currently in alpha) logs from add-on providers. Sinks are merged together into channels (each app has its own channel) which is a unified stream of all logs relevant to the app. This allows developers to see a holistic view of everything happening with their app, or to filter down to logs from a particular type of sink (for example: just logs from the HTTP router, or just logs from worker processes).

Further, log streams can also be sent outbound, which we call “drains.” Users can configure syslog drains, and we’re currently working up a technical design for how add-on providers can automatically add drains. This latter item will enable a new class of log search and archival add-on, most notably the emerging syslog-as-a-service products like Loggly and Papertrail.

This logging system works quite well, and it gets even better with the new features on the way - but it only works where all programs output their logs as streams. Programs that write logfiles, such as Rails in its default configuration, don’t make sense in this world.

As a workaround, Heroku injects the rails_log_stdout plugin into Rails apps at deploy time. We’d prefer not to have to do this (injecting code is a dicey way to solve problems), but it’s the best way to get Rails logs into the app’s logstream without requiring extra configuration from the app developer.

Conclusion

Logs are a stream, and it behooves everyone to treat them as such. Your programs should log to stdout and/or stderr and omit any attempt to handle log paths, log rotation, or sending logs over the syslog protocol. Directing where the program’s log stream goes can be left up to the runtime container: a local terminal or IDE (in development environments), an Upstart / Systemd launch script (in traditional hosting environments), or a system like Logplex/Heroku (in a platform environment).

Memcached, a Database?

2010-07-19T11:53:00-07:00

In my QCon talk Horizontal Scalability via Transient, Shardable, Share-Nothing Resources, I argued that memcached is the father of modern shardable resources. Today’s NoSQL key-value stores all owe some part of their inspiration to memcached. Even feature-rich datastores such as CouchDB or Cassandra also borrow a cornerstone idea from memcached: throw away some features historically associated with databases in order to make big gains in scalability and resiliency.

Memcached was created to be a cache, as its name implies. But developers eventually discovered that it was useful for storing many types of transient data, such as sessions, page-view counters, or API rate limiting counters.

App developers storing data in memcached instead of their SQL database? Does that mean that memcached can be classified as a type of database system?

First Principles

To answer that question, we have to work our way back to a definition for the family of software typically referred to as “databases.” I’m going to use the term datastore, because it seems more natural when applied to modern NoSQL options. (For simplicity’s sake, let’s assume that datastore, database, database system, and DBMS are all roughly synonymous.)

Here’s my definition:

A datastore is software that stores atomic chunks of data known as records, and allows those records to be retrieved later.

Datastores are a superset that includes relational databases, graph databases, key-value stores, and document databases. DBM, Tokyo Cabinet, Redis, S3, MySQL, PostgreSQL, CouchDB, MongoDB, Neo4j, and Hadoop are all part of this big happy family. Now onto the question of whether memcached belongs here as well.

On Persistence

Many would argue that memcached should be disqualified from being considered a datastore on account of its transience.

My definition above says that you can retrieve the data you’ve stored later. But what’s the duration of “later”? We expect datastores to be persistent - if they aren’t, what’s the point? But persistence does not have to be forever. It only needs to last as long as the application logic requires.

MongoDB offers capped collections and Redis offers expiring keys; in both of these cases, the fact that the data does not persistent forever is a feature. Memcached is a datastore which has extreme transience as a feature. How many times have application developers written nightly cron jobs to clean up old session data from their SQL datastore? Using memcached, you can skip this extra garbage-collection step. Memcache is a good fit for data that you want to last for a little while, but not forever.

Conclusion

Memcached set an early example for many patterns now prevalent in NoSQL. It got us thinking about how we can make trade-offs between datastore features and ease of scaling. Memcached occupies the far extreme of this spectrum: it trades away almost every feature we associate with database systems, keeping just the bare minimum, and in return it gets blinding speed and near-infinite horizontal scalability. That trade proved to be a worthwhile one, as memcached is now a critical piece of infrastructure for many of the world's largest web apps.

The memcached case is a great example of how NoSQL is broadening how we think about data storage and retrieval. This has opened us up to a variety of specialized datastores: memcached, S3, and Hadoop, to pick some very successful examples. Each of these occupies a unique (and often very large) niche in the data storage space. We’ve learned that not all data is the same; the proliferation of options for how we store and retrieve our data is a natural consequence.

Replace Cron with Clockwork

2010-06-30T20:18:02-07:00

If your app needs to poll a remote API once an hour, or send out an email report every evening, what tool do you reach for? Probably cron. Triggering events at a given wall clock time is what cron is for, but it works better at the system layer (e.g. rotating logs on a server) than at the app layer (e.g. sending out a daily report to your app’s users). I’ve described all the ways cron could be improved for app clock events in a previous post.

My wishlist for an app-focused cron replacement, described in that post, can be fulfilled by a little hackery with a few available Ruby libraries (rufus-scheduler and resque-scheduler). But both of these libraries have weaknesses; so I decided to write my own, following their example of the lockless, single-process scheduler pattern.

The result is Clockwork.

Using Clockwork

First, the syntax for scheduling events:

every 1.hour, 'apis.poll'
every 1.day,  'reports.email', :at => '00:00'

A time period and a job name are the only required parameters. Options may include an hour and minute to run for daily jobs.

The job name is passed to your queueing system to enqueue a job, to be worked in one of your background job workers. (An important part of the lockless scheduler process pattern is that it never does any work itself, only queues up jobs for the workers to handle.) In order to make Clockwork queueing system-agnostic, the second bit of code you need is a small handler block that declares how to enqueue a job.

For example, if you’re using my favorite combo, Beantstalk+Stalker, your handler block will look like this:

require 'stalker'
handler { |job| Stalker.enqueue(job) }

Put these two segments together into a file named clock.rb:

require 'stalker'
handler { |job| Stalker.enqueue(job) }

every 1.hour, 'apis.poll'
every 1.day,  'reports.email', :at => '00:00'

Running the Clock Process

To run, install the clockwork gem (gem install clockwork, or specify it in your Gemfile), and then run with the clockwork binary:

$ clockwork clock.rb
[2010-06-28 11:27:42 -0700] Starting clock for 2 events: [ apis.poll reports.email ]

Or with Bundler: bundle exec clockwork clock.rb

More details about the use and operation of Clockwork can be found in the readme.

A Sample Application

To illustrate what Clockwork would look like in a full application, I’ve written a sample app which fetches the Dow Jones index from Google Finance once every three minutes. The clock process enqueues the fetch job. The worker works the job, pulling down the index from the remote API, and storing the result in the database. The web app pulls from the database, showing the user all historic data points.

I wrote the same app with two web framework / database / queue combos, so pick the one that suits your style:

In both cases, the app has three processes: the web process (serving web requests to the user), the clock process (enqueuing jobs periodically), and the worker process (working the job to fetch data from the remote API and store it in the database).

I can’t overemphasize the importance of the clock process being separate from your worker process. The reason for this is that the clock is not horizontally scalable (and doesn’t need to be); but your worker processes are fully parallelizable. In a real app, you’d run two, four, ten, or a hundred workers. You will only ever have one clock. The clock process can and must stay lightweight, doing no more than queueing jobs when the appropriate wall clock time is reached.

Conclusion

Replacing a tried-and-true tool like cron is not something to be undertaken lightly. However, after years of dissatisfaction with cron as a tool for app-level scheduling, I truly believe it’s time to try something different. I’ve been using Clockwork in a number of my own personal and work apps, and I’ve been very pleased with the results so far. Give it a try and tell me what you think.

Gluecon Slides

2010-05-29T23:18:25-07:00

Cloud Services

View more presentations from Adam Wiggins.

Startup Lessons Learned

2010-04-30T12:19:55-07:00

Like many folks in the startup crowd, I’m a reader of Eric Ries' blog (some links), and I’ve read Steve Blank’s Four Steps to the Epiphany. What I didn’t know is that these guys have joined forces to build a movement they are calling “lean startups.” After attending the Startup Lessons Learned conference last week, I now believe this methodology is on its way to making a major impact on the world of entrepreneurship.

Lean startup methodology has a lot in common with agile. But where agile applies to software, lean startups applies to customers and markets. Customer discovery, validation of markets, iteration on product, and intensive customer feedback are all part of the lean startup.

The energy at the conference reminded me of what Ruby conferences were like a few years ago. Charismatic, passionate, opinionated leaders draw together a crowd of strangers; and then those strangers look around to realize they are surrounded by people that share their passions. It’s the birth of community.

I took some notes during some of the talks. What follows are some of the quotes I jotted down, and some commentary.

Randy Komisar on Pivots

Randy Komisar wrote Getting to Plan B: Breaking Through to a Better Business Model. His thesis is: your first idea never works, but that’s ok. What’s really important is getting to the next idea, and the next and the next, zeroing in on something that will work - and all of this as quickly and as cheaply as possible. Transitioning between plans is called a pivot, a word that was in heavy use by most of the speakers at the conference.

Some quotes from Komisar:

“Plan A never works”
“‘Lean’ means get to the right answer with as little time and money as possible”
“I invest in people irrationally committed to a purpose” - Founders believe in a vision; maximizing their personal wealth is a side-effect, not a primary purpose. Being an entrepreneur is not a good way to make money, even though some people strike it rich.
“Leap of faith question” - The premise your startup is built on. What question can you ask, where the answer will make or break your business? For example, “People will pay more for outstanding design” might have been Apple’s leap of faith in the 2000s. “People will switch to using personal productivity software on the web” could have been 37Signals’ leap of faith.
“Once you decide to change, you will always wish you changed earlier”
“Everything is derivative - that’s not a bad thing. Steal liberally”
“We’ve got to zig and zag through the realities of the opportunities in front of us and the information they are giving us” - Founders aren’t founders because they know what to do. They’re founders because they can figure out what to do, quickly, in the face of rapidly changing information. This is why, for example, fixed business plans are of no use in a startup.

During the discussion with Randy, Eric Ries used the term “success theater” to describe what happens in boardrooms when plan A starts to go south. Instead of admitting “what we’re doing isn’t working, we need to try something else,” founders dress up the trajectory of the business in false clothes. This doesn’t help anyone in the long term.

Pivots are what startups do. The sooner that investors, founders, early employees, and early customers come to grips with this, the less heartache needs to surround each pivot, and the quicker you can get to the right answer.

Steve Blank on Entrepreneurship

Much as I like Four Steps to the Epiphany, I’ve never gotten much value from Steve Blank's blog - so I wasn’t expecting much from his talk. To my surprise, I was absolutely riveted. While Eric Ries is the father of the lean startup movement, Steve Blank is a very active and hands-on grandfather. His presentation was both enlightening and inspiring.

There was so much good stuff in this talk it’s hard to capture it all. A few quotes:

“A startup is a search for a scalable, repeatable business model”
“No business plan survives first contact with customers”
“Startups search and pivot. Large companies execute.”
“Founders make order from chaos”
“Lean startup is the first business methodology that is being crowdsourced and developed iteratively - we’re collectively getting smarter at a scary rate”
“My personal goal is to change the state of entrepreneurial education in the United States”
“In the 1950s, Venture Capital was called Adventure Capital”

Blank lays out the lifecycle of a scalable startup in three phases: search, build, grow.

Search - The one and only mission of the company in its early life is to search for a scalable business model. Nothing else matters. Small team, little to no management, very little of the formal trappings of a company. Staying lean, nimble, and chaotic is how you search rapidly. Formality and structure only slow you down.
Build - Once the business model is found (in technology, this usually comes in the form of a software product that people love and have demonstrated willingness to pay for) the company starts to build out. Here the team is expanding, infrastructure is being put in place, branding and market position clarified. The organization goes from feeling like a ragtag band of buddies working on something made out of passion and elbow grease, and to something that feels like a “real” company.
Growth - Everything is figured out, the company’s direction is decided: it’s now a matter of turning up the volume and continuing the business model on increasingly large scales. This is generally where the founders and many of the early employees of the company will make an exit. There are examples of founders who have stayed on through the final phase: Bill Gates, Steve Jobs, Larry Ellison. But these guys are the exception, not the rule (and that’s part of what they are famous). Founders need to be aware of, and prepared for, the likelihood that success means they have made themselves irrelevant in the organization they have built.

A Tale of Two Businessmen

Blank closed with a fascinating story about two figures involved in the early life of General Motors. The first was Alfred Sloan. Sloan was the CEO of GM Motors in the early part of the 20th century. He’s widely recognized as the man that took GM to being the largest company in the world. Many business schools are named after him, and his managerial style was considered to be a pioneering approach that defined the new business of the 20th century.

The other player in this story is virtually unknown: Billy Durant. Durant founded GM and took it up to $3.6 billion in revenue (that number is adjusted for today’s dollars, if I’m recalling correctly). He was then fired by the board of directors, and he left to found Chevrolet. He quickly grew that company until it was bigger than GM, and then he bought GM. This guy was the Steve Jobs of his day - why don’t we remember him?

The answer is that the last century of business education has focused almost entirely on the last stage of a company’s life. Business degrees are MBAs, which Blank cautions are useless or perhaps even harmful in the early life of a startup. (MBAs working at a startup will try to apply their knowledge, creating structure and formality at a time when that’s the worst possible thing you can do.) Blank feels that entrepreneurial education should be separate from business education - B-schools can give out MBAs, and E-schools should give out MEAs.

He argues we’ve seen the first glimpse of this in the past several years, pioneered by Y Combinator. Blank points out that there are now over 100 (!) YC clones in operation, proof of the huge thirst for startup-focused education. He has a goal of bringing this entrepreneurial education into a more academic setting as well.

While he hasn’t done this yet (though he sounds quite serious about it), he offers up a small bit of entertainment to tide us over: the Durant School of Entrepreneurship, available in T-shirt form.

Beanstalk, a Simple and Fast Queueing Backend

2010-04-24T14:08:37-07:00

Web apps are increasingly focused on background jobs. In fact, the term “background job” almost seems inaccurate - the heavy lifting done by worker processes is often the meat of the app’s purpose. The web portion of the app, by comparison, does only the relatively lightweight work of putting job requests into queues, and later presenting the results of jobs as HTML or JSON.

I’ve previously written about queueing via Delayed Job. DJ uses your database as its backend, which is a great way to start, but doesn’t scale well.

I’ve also described Minion backed by RabbitMQ for a more robust queueing solution. While I love Minion’s simple jobs DSL, RabbitMQ can feel like overkill for apps that aren’t huge distributed systems. AMQP is a complex protocol with lots of capabilities outside the scope of job queueing. These capabilities become dead weight for most apps, which only need a way to enqueue and work jobs. I find this especially poignant when I’m building an app that uses Sinatra, Redis, and Memcache. RabbitMQ’s ponderous footprint doesn’t fit in with these nimble backend daemons.

Discovering Beanstalk

Ilya Grigorik pointed me toward Beanstalk, a job queueing backend inspired by Memcache. It’s simple, lightweight, and completely specialized on job queueing. They use it at PostRank to process millions of jobs a day, so it does perform at scale.

I’ve found Beanstalk to be a joy to use. The difference between RabbitMQ and Beanstalk reminds me of the difference between Apache and Nginx, or between Squid and Varnish. It gives 80% the functionality with 20% the weight and complexity. The authors have definitely achieved their goal of making a job queueing backend which has the same clean simplicity as memcached.

Installation

On Mac OS X, install Beanstalkd like this:

$ sudo port install beanstalkd

(Or build from source.)

Running it couldn’t be simpler:

$ beanstalkd

Stalker, a Minion-like Job Queueing DSL

The Ruby beanstalk client is extremely simple - put a string onto a queue, pull it off later. This is great, but it’s just a smidge too unstructured for my taste. So I wrote Stalker, a DSL almost identical to Minion, but for Beanstalk.

Enqueue jobs like so:

Stalker.enqueue('email.send', :email => 'joe@example.com')

In a jobs.rb file, define a how to work each job:

include Stalker

job 'email.send' do |args|
  Pony.email(:to => args['email'], :subject => "Hello there!")
end

Now you can run one or more worker processes to work your jobs. Stalker includes a handy binary:

$ stalk jobs.rb
[Sat Apr 17 14:13:40 -0700 2010] Working 3 jobs  :: [ email.send twitter.post image.resize ]

By default, it will work all jobs you’ve defined. But you can also filter it down to a list by specifying job names on the command line:

$ stalk jobs.rb email.send,twitter.post
[Sat Apr 17 14:13:40 -0700 2010] Working 2 jobs  :: [ email.send twitter.post ]

This will allow you to run one pool of workers for fast or high-priority jobs, and another pool for general work.

Features for Job Queueing

Though lightweight, Beanstalk’s laser-sharp focus on its singular purpose of job queueing allows it to deliver many features extremely useful for that purpose. For example:

Priorities - Give a number from 0 to 1000 when queueing a job and it will jump ahead of all jobs already enqueued with a higher number.
Persistence - Although beanstalkd stores its jobs in memory for speed and simplicity (ala memcached or redis-server), it can also save its state to a file so that you can cycle the beanstalkd process without losing any jobs.
Federation - Fault-tolerance and horizontal scalability is provided the same way as Memcache - through federation by the client. Take a look at how the Ruby client handles multiple beanstalkd servers, it’s really quite clever.
Buried jobs - When a job causes an error, you can bury it. This keeps it around for later introspection and debugging (or even re-running it), while keeping it separated from active jobs.
Timeouts - The default behavior for jobs not acknowledged by a client (by deleting it when finished) to re-queue. This prevents failed jobs (particularly from a client that loses its connection partway through the job) from getting lost, the same purpose of ack in AMQP. Delayed Job uses its locked_at and locked_by fields for this purpose, but it’s very easy for a worker which doesn’t exit cleanly to leave jobs in a jammed/stuck state. Beanstalk’s reserve, work, delete cycle, with a timeout to dereserve the job, means it’s impossible for a bad client to prevent a job from completing.

Beanstalk’s features are described in more detail on the FAQ.

Performance

Beanstalk feels very snappy overall. I ran some off-the-cuff benchmarks against a handful of Ruby-friendly queueing systems on my laptop, and here were my results:

	enqueue	work
delayed job	200 jobs/sec	120 jobs/sec
resque	3800 jobs/sec	300 jobs/sec
rabbitmq	2500 jobs/sec	1300 jobs/sec
beanstalk	9000 jobs/sec	5200 jobs/sec

Don’t take these numbers too seriously, as I didn’t make any attempt to be rigorous or simulate real-world conditions. But they do give some quantitative support to my sense that Beanstalk is smokin’ fast.

Wrapup

A port of QFeedreader to Stalker requires only a few lines of code changed, but we get to cut out a tons dependency gems required for the AMQP backend. Judged by weight of dependencies removed, switching to Beanstalk/Stalker looks favorable.

One thing still lacking in the Beanstalk community are good introspection tools - something that, so far, only Resque has made much progress on. Some command-line tools exist, which indicate that the Beanstalk protocol has all the introspection capabilities necessary. So building a user-friendly interface introspection interface (command line or web) seems entirely possible.

Another thing missing from Beanstalk is authentication. The authors probably assume that you’re running in a traditional environment with IP/firewall-based access control, but this doesn’t jive with cloud environments. Memcached recently added SASL to solve this. I asked about this on the mailing list and it seems the Beanstalk author(s) are open to this possibility.

Lastly, I note that right now the only queueing system available as a service is Amazon SQS. Beanstalk would be make a beautiful multitenant cloud service - very similar to the way MongoHQ is running MongoDB as a service. I sense there is a great opportunity here for someone to found a Beanstalk-as-a-service startup.

Rethinking Cron

2010-04-13T15:42:56-07:00

Cron is a trusty tool in the unix toolbox for scheduling work to run at periodic intervals. In addition to system tasks, it’s common for app developers to use an app-specific crontab to run application tasks. For example, if your app is a feed reader, you might use a cronjob to fetch new feeds every three hours, and another cronjob to clean out old unread articles every night.

Cron Weaknesses

While application crontabs have served us well enough, this technique has a number of weaknesses.

One problem is that cron is per-machine, so once you scale to multiple app servers you’ll need locks stored in a shared location (database or memcache) to avoid scheduling the same job twice. Locks require maintenance on those locks - cleaning up stale locks from cronjobs that exited abnormally or got stuck in an infinite loop. What was a one-line cronjob can quickly balloon into a whole mess of pidfiles, locks, and cleanup code.

Cron problems are difficult to debug. The arcane syntax of crontab is terse to the point of near inscrutability, making it easy to accidentally schedule jobs at the wrong time. And the subtle differences between a cronjob’s shell environment and your command prompt’s shell environment can be maddening. Lack of feedback makes these or any other problem with your cronjobs difficult to diagnose.

Lastly, cronjobs have a tendency to be turn into a kind of poor-man’s background job solution. Check the crontab for any reasonably complex application and there’s a good chance you’ll see a one minute or five minute cronjob which looks in the database for work to be done. This can almost always be better done with a job queueing + workers system. Cron is for scheduling things, not doing them.

While cron will remain the ideal solution for system tasks like log rotation for some time to come, the above problems with application use of cron suggest that it might be time for a new scheduling solution for apps.

Cron Replacement Wishlist

My wishlist for a new app scheduling solution is:

Powerful and human-friendly syntax
Easy to test
Visibility
No difference between scheduler environment and one-off / test environment
Encourage use of a queueing system rather than doing the work directly in the scheduler
Scales without use of locks

Recently, the Flightcaster guys introduced me to resque-scheduler. With resque-scheduler, you make a yaml file of jobs to be scheduled. When each time specified is reached, the job will be queued via the Resque job queueing system.

What’s most interesting to me is that redis-scheduler runs in a standalone, long-running daemon process. Launch it like this:

$ rake resque:scheduler

The standalone process is an fascinating solution to the locks problem. Because there’s only one process, you don’t need any locks - an approach that sounds strikingly similar to the reasons for using async. A data format (yaml) rather than code prevents you from doing any work in the scheduler, since you can only specify the name of a job to queue. This enforces that the work will be done in the background workers, where they belong. Since the scheduler process does no heavy lifting, there are no scalability issues.

For diagnostic/debug visibility, set up logging and exception handling (e.g. Exceptional, Hoptoad) exactly like you would for your web or worker processes. resque-scheduler also provides some extensions to the Resque web UI (screenshots at the bottom of this page) for additional visibility and control.

Generalizing the Single-Process Scheduler

Resque-scheduler still uses a cron-style syntax for specifying when jobs will run; and Resque is not my favorite queueing system anyway (I prefer dedicated MQ backends like RabbitMQ, Kestrel, and Beanstalk). But the single-process scheduler idea implemented by resque-scheduler can easily be applied to other queueing systems. For example, you could use rufus-scheduler in combination with Minion+RabbitMQ to write a scheduler process for your app. In a file called scheduler.rb:

require 'rufus/scheduler'
require 'minion'

scheduler = Rufus::Scheduler.start_new
scheduler.every '5m' { Minion.enqueue('twitter.refresh') }
scheduler.every '3h' { Minion.enqueue('feeds.refresh')  }
scheduler.join

You’ve probably already defined or documented somewhere a list of processes needed to run your app. This may be one or more web processes (mongrel_cluster start, thin start, or unicorn start) and one or more worker processes (rake jobs:work, rake resque:work, or ruby minion.rb). Add to this list your new scheduler process:

ruby scheduler.rb >> log/scheduler.log 2>&1

Conclusion

While the single-process scheduler approach is still in its infancy, I believe it bears strong potential for the future of application cron.

URLs are the Uniform Way to Locate Resources

2010-03-30T16:06:38-07:00

When you hear the term URL, what do you think of? Probably a web address - e.g., a publicly accessible HTML page such as http://google.com/ or http://news.ycombinator.com/. But URLs have a much wider application.

URL stands for Uniform Resource Locator. Decoding this, a URL is a uniform (standard) way to locate (find) any resource (service) over a network (the internet or a LAN).

Any time you wish to locate a resource on the internet, use a URL.

Example: Git

If you use Git, then you’ve probably already encountered a non-HTTP URL: the Git protocol. For example, here’s the URL to the public Git repo for the Paperclip file attachment library:

git://github.com/thoughtbot/paperclip.git

A Git repo is not an HTML page, but it is a resource on a network, so using a URL makes perfect sense.

You could potentially encode this repo’s location in another way. For example, you could break it out into pieces and provide it in a JSON file:

{
  "protocol": "git",
  "host": "github.com",
  "username": "thoughtbot",
  "project": "paperclip"
}

Why don’t we use this format for locating Git resources? There are a few potential answers, such as the convenience of being able to easily cut-and-paste the location into a command line tool or a URL bar. But the best answer is that our ad-hoc JSON format is not uniform. The JSON above would work for locating Git resources on Github, but nowhere else. URLs are standard and uniform.

Example: Databases

Another great example is the location of a database. One approach is to have a long list of configuration values, probably copied into a file like config/database.yml by hand, one at a time. This format is probably specific to your ORM, e.g. not standard or uniform in any way. It’s the equivalent of the JSON address we used to specify a Git repo in the previous section.

Just like Git, the more elegant approach is to put everything needed to locate the database into a URL. This will typically look something like:

mysql://myuser:mypass@db8.myhost.com:3306/mydatabase

Ruby ORMs like Sequel and DataMapper use this very method. This makes configuring your database very simple:

Sequel.connect(the_database_url)

Beautiful.

Yet More Examples: RabbitMQ, Email, Memcache

What else can we use URLs for? Anything that needs to be located on a network, be it the internet or a local network. For example, how about your RabbitMQ message queue?

amqp://user:pass@hostname/vhost

Or your SMTP mail server?

smtp://user:pass@hostname/domain

Or your Memcache server?

memcache://hostname/prefix

On this last item, you might point out that a Memcache cluster often has multiple hosts. Typically, these are specified in an array of IP addresses passed to the client object constructor. While this works, it’s not uniform. A better solution here is to use an internal hostname (such as memcache.internal.yourhost.com) which returns multiple A records, one per server in your cluster. The returned IPs may well be 10. or 192. addresses, not publicly addressable. In addition to allowing your memcache config to conform to the URL specification, this also gives the benefit of managing your server IPs in a single place, DNS. The alternative is hardcoding IPs into every component of your system that uses your memcache servers.

What About Extra Config Options?

If the protocol for a given resource requires additional config options, you can pass them as query parameters:

sqlite://development.sqlite3?encoding=utf8

I would urge you to think carefully before using query params. 99% of cases should be representable within the base URL.

Summary

URLs are uniform. Use them to locate your resources.

Value-Creating Activities

2010-03-22T01:13:42-07:00

Inspired by the lean manufacturing revolution (and excellent books like Lean Thinking), I started with a first fundamental question: in a startup, what activities are value-creating and which are waste? Usually, new projects are measured and held accountable to milestones and deadlines. When a project is on track, on time, and on budget, our intuition is that it is being well managed. This intuition is dead wrong.

From Is Entrepreneurship a Management Science? by Eric Ries

Consuming the Twitter Streaming API

2010-03-19T11:01:54-07:00

If you’ve been using polling to track Twitter search terms (totally random example), you may have wondered if there is a more efficient and reliable method. The Twitter streaming API is a potential solution.

Try out the sample stream with curl:

$ curl http://stream.twitter.com/1/statuses/sample.json -uYOUR_TWITTER_USERNAME:YOUR_PASSWORD

Track a term in realtime, like “ruby”:

$ curl http://stream.twitter.com/1/statuses/filter.json?track=ruby -uYOUR_TWITTER_USERNAME:YOUR_PASSWORD

How do you integrate this into a Ruby app? Standard HTTP clients such as RestClient and HTTParty aren’t appropriate, since they’re designed for atomic HTTP requests, not streaming. With this API, you want to keep the socket open indefinitely, decoding JSON one line at a time.

Async I/O is the right tool for this job. Here’s an example script using Ilya Grigorik’s evented HTTP client. Install the em-http-request gem, then:

require 'eventmachine'
require 'em-http'
require 'json'

usage = "#{$0} <user> <password>"
abort usage unless user = ARGV.shift
abort usage unless password = ARGV.shift

url = 'http://stream.twitter.com/1/statuses/sample.json'

def handle_tweet(tweet)
  return unless tweet['text']
  puts "#{tweet['user']['screen_name']}: #{tweet['text']}"
end

EventMachine.run do
  http = EventMachine::HttpRequest.new(url).get :head => { 'Authorization' => [ user, password ] }

  buffer = ""

  http.stream do |chunk|
    buffer += chunk
    while line = buffer.slice!(/.+\r?\n/)
      handle_tweet JSON.parse(line)
    end
  end
end

Run this at the command line with your Twitter username and password as arguments, and it will start printing out results. In a real app, you’d replace the body of handle_tweet with code to do something like inserting the result into your database.

Note that, even in a production app, you should never run more than one of these processes. It’s a background worker of sorts; you can think of the open socket as a queue that’s delivering jobs. But since this queue can’t split the work among multiple workers, you’re limited to just one.

Alumni

2010-03-18T13:54:05-07:00

A company with a culture of quitting does not have ex-employees; they have alumni. This is far more than a semantic distinction. An alumni relationship is positive; something that people can take pride in; and one that keeps the door open for further opportunities on both ends.

From Up or Out: Solving the IT Turnover Crisis by Alex Papadimoulis

Salivation, Espresso Machines, and Tears

2010-03-17T20:39:11-07:00

Normally I’m not much for farewell posts (they’re metaposts, which I don’t like in general), but Joel Spolsky's pseudo-retirement shows a self-aware sense of humor that I respect:

What I am stopping is the traditional opinionated essay that has characterized Joel on Software for a decade. I’m not going to write Ten Ways to Get VCs to Salivate, I’m not going to write Why You Have To Buy a $10,000 Italian Espresso Machine for your Programmers, and I’m not going to write Python is For Aspergers Geeks or Ruby is for Tear-streaked Emo Teenagers. After a decade of this, the whole genre of Hacker News fodder is just too boring to me personally. It’s still a great format… the rest of you, knock yourselves out… I just can’t keep doing that particular thing.

Graph Databases

2010-03-15T21:30:15-07:00

Graph databases are a type of datastore which treat the relationship between things as equally important to the things themselves. Examples of datasets that are natural fits for graph databases:

Friend links on a social network
“People who bought this also bought…” Amazon-style recommendation engines
The world wide web

In graph database parlance, a thing (a person, a book, a website) is referred to as a “node,” while a relationship between two things (a friendship, a related book, an href) is referred to as an “edge.”

In most types of databases, the records stored in the database are nodes, and edges (relationships) are derived from a field on a node. In a SQL database, for example, you might have a table called “people” that includes a field “friend_id.” friend_id is a reference to another record in the people table.

The weakness with reference fields becomes apparent as soon as you want to do many-to-many relationships, or store data about the relationship. A person can have many friends; and you might want to track the date the friendship link was created, or whether the two people are married.

The solution to this in a SQL database is a join table. In the people/friends example, your join table might be called “friendships”. But this method has some weaknesses. One is that it can greatly increase the number of tables in your database, and may make it hard to tell apart standard tables (nodes) from join tables (edges) - which makes it more difficult for new developers to comprehend the database architecture. Another problem is that ORMs, which work quite well for mapping node (model) tables, generally have a much harder time mapping edges. (Witness all the thrashing about that happened during the development of has_and_belongs_to_many :through in ActiveRecord.)

But the biggest weakness is that queries against relationship data - be it in join table or a reference link - are extremely unwieldy. In a SQL database it typically leads to recursive joins, which tend to lead to long, incomprehensible SQL statements and unpredictable performance.

A graph database is designed to represent this type of information, so it models the data more naturally. It’s also designed to query it: you can walk the data in a convenient and performant manner.

I’ve yet to try using a graph database, but the concept is intriguing. It’s yet another reminder that not every data modeling problem can be solved with the same hammer.

Grown, Not Built

2010-03-14T11:40:19-07:00

We just don’t write or release software the way we used to. Software isn’t so much built as it is grown. Software isn’t shipped … it’s simply made available by, often literally, the flip of a switch. This is not your father’s software. 21st century development is a seamless path from innovation to release where every phase of development, including release, is happening all the time. Users are on the inside of the firewall in that respect and feedback is constant. If a product isn’t compelling we find out much earlier and it dies in the data center. I fancy these dead products serve to enrich the data center, a digital circle of life where new products are built on the bones of the ones that didn’t make it.

From Testing in the Data Center (Manufacturing No More) by James A. Whittaker

An HTML5 Offline App Example

2010-02-25T09:42:33-08:00

If you’ve used GMail, Google Calendar, or other Google web apps on the iPhone, you’ve probably noticed that they store the app code in a local cache. Only the messages (or day’s events, or other dynamic data) are fetched when you load the app. This is because they use HTML5’s capabilities for offline caching.

The HTML5 draft has a simple clock example, which shows how you can specify which files should be cached locally for offline use using something called a cache manifest.

I turned their example code into an app deployable to Heroku. Here's the live demo. It should work in recent version of Firefox (which prompts you to allow offline storage) and Safari (which doesn’t). Chrome doesn’t seem to support it yet.

Basically this boils down to some static HTML, CSS, and javascript; the cache manifest is the one additional piece of the puzzle, which tells which files to cache. Its format is extremely simple:

CACHE MANIFEST
clock.html
clock.css
clock.js

The one potential gotcha is that the cache manifest has to be served with content type text/cache-manifest. You can verify the content type is correct with curl:

$ curl -I http://cachemanifest.heroku.com/clock.manifest
HTTP/1.1 200 OK
Server: nginx/0.6.39
Date: Thu, 25 Feb 2010 02:53:24 GMT
Content-Type: text/cache-manifest

Above the Water

2010-02-22T14:04:39-08:00

A PaaS Platform as a Service environment is a bit like a swan on a pond – graceful and elegant above the water, and paddling its little legs off below the water. The aforementioned abstraction provides the elegant user experience “above the water,” while high levels of automation provide the “paddling” beneath the surface.

From Don't Pass on PaaS by Sam Charrington

Uncertainty

2010-02-11T22:52:57-08:00

Kevin Kelly writes on how the internet has changed how he thinks:

Uncertainty is a kind of liquidity. I think my thinking has become more liquid. It is less fixed, as text in a book might be, and more fluid, as say text in Wikipedia might be. My opinions shift more. My interests rise and fall more quickly. I am less interested in Truth, with a capital T, and more interested in truths, plural. I feel the subjective has an important role in assembling the objective from many data points. The incremental plodding progress of imperfect science seems the only way to know anything.