CGF (Cascading Global Failure)

ops

Sun Jun 28 12:51:55 -0700 2009

My partner Orion Henry coined the term Cascading Global Failure, or CGF, to describe a type of catastrophe situation in production software deployments. (The term is inspired by the movie No Country For Old Men, in a line spoken by Tommy Lee Jones as he surveys a scene of mayhem and carnage early in the movie.)

CGF is not the right label for ordinary system failures. An ordinary failure has a single cause, and one or more symptoms limited in scope, though not necessarily damage. Ordinary failures may have serious symptoms - like your site being unavailable - but the diagnosis is straightforward. For example: your entire site throwing a 500 because you failed to run your database migrations. This is a simple, though fairly catastrophic, failure. This would not qualify as a CGF.

A CGF is a failure in which many small problems overlap, creating series of symptoms that cascade through the entire system to produce symptoms which may seem entirely unrelated to the root causes. For example: a bug in one daemon is causing it to spew messages; which in turn causes the message queue to get backend up; which exposes a previously unknown timing issue in another one of the daemons, causing it to send REST calls out of order; and the app that receives those calls doesn’t verify its input closely enough, and starts writing corrupt records into the database.

CGFs can be maddening to debug, because there is no single cause. The issues which combine to create the CGF are usually totally unrelated - often a bug or multiple bugs that have been lurking in the code for months are exposed by a recent code change or change in the system load. In some cases, a CGF can become a strange loop: the tools that you would use to diagnose the issues are themselves slow, showing wrong information, or totally non-functional as another symptom of the CGF.

Thanks to the Coen brothers for providing us with this descriptive term.