Why incidents can’t be monocausal

When an incident happens, the temptation is strong to identify a single cause. It’s as if the system is a chain, and we’re looking for the weak link that was responsible for the chain breaking. But, in organizations that are going concerns, that isn’t how the system works. It can’t be, because there are simply too many things that can and do go wrong. Think of all the touch points in your system, how many opportunities there are for problems (bugs, typo in config, bad data, …). If any one of these was enough to take down the system, then it would be like a house of cards, falling down all of the time.

What happens in successful organizations as that the system evolves layers of defense, so that it can survive the kinds of individual problems that are always cropping up. Sure, the system still goes down, and more often than we would like. But the uptime is good enough that the company continues to survive.

Here’s an analogy that I’m borrowing from John Allspaw. Think about a significant new feature or service that your organization delivered successfully: one that took multiple quarters and required the collaboration of multiple teams. I’d wager that there were many factors that contributed to the success of this effort. Imagine if someone asked you: “what was the root cause for the success of this feature?”

So it is with incidents. Because an organization can’t prevent the occurrence of individual problems, the system evolves defenses to protect itself, created by the everyday work of the people in the company. Sure, the code we write might not even compile on the first try, but somehow the code that made it out to production is running well enough that the company is still in business. People are doing checks on the system all of the time, and most of this work is invisible.

For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.

Elegance: UI vs implementation

If you ask the question, “What is a Docker container?”, it turns out that the Linux operating system doesn’t actually have a notion of a container at all. Instead, a Docker container refers to a cobbled together set of Linux technologies, such as cgroups, network namespaces, and union filesystems. However, from the point of view of the end-user, a container is very much a real thing. In particular, it exposes to the user images as an entity, and a command-line tool for pulling down images from a repository and running them.

The Docker container implementation may be built (in the Unix tradition!) with duct table and baling wire, but the user interface is elegant. It’s easy for a new user to get started with Docker once they’ve installed it. Bryan Cantrill points out that the advantage of Docker over container technologies developed in the BSD world is Docker’s notion of images as effectively static binaries that allow developers to think operationally and move faster.

Contrast that with git. The implementation is quite elegant: git represents the data under version control as a hash tree, with pointers into nodes in the tree. Git commands are tree manipulations: adding leaf nodes, moving branches from one part of the tree to another, smooshing nodes together, and so on. (I used Subversion for years and had no idea what was going on under the hood).

On the other hand, the command-line interface that git exposes is a nightmare. It’s so hard to use that trying to build a more usable command-line interface is a full-blown academic research project.

The elegance of a user interface and the elegance of an implementation are orthogonal. One doesn’t necessarily lead to the other.

Postmodern engineering

When I was younger, I wanted to be a physicist. I ended up majoring in computer engineering, because I also wanted gainful employment, but my heart was always in physics, and computer engineering seemed like a good compromise between my love of physics and early interest in computers.

I didn’t think too deeply about the philosophy of science back then, but my beliefs were in line with the school of positivism. I believed there was a single underlying reality , the nature of this reality was potentially knowable, and science was an effective tool for understanding that reality. I was vaguely aware of the postmodernist movement, but mostly by reading about the Sokal hoax, where the physicist Alan Sokal had demonstrated that postmodernism was nonsense.

Around the same time, I also read To Engineer is Human: the Role of Failure in Successful Design by the civil engineering researcher Henry Petroski. The book is a case study on how civil engineering advanced through understanding structural failures. Success, on the other hand, teaches the engineer nothing.

Many years later, I find myself operationally a postmodernist (although constructivist might be a more accurate term). When I study how incidents happen, I no longer believe that there is a single, underlying reality of what really happened that we can access. Instead, I believe that the best we can do is construct narratives based on the perspectives of the different people that were involved in the incident. These narratives will inevitably be partial, and some of them may conflict. And there are things that we will never really know or understand. In addition, contra Petroski, I also believe that we can learn from studying successes as well as from studying failure.

I suspect that most engineers are steeped in the positivist tradition of thinking as well. This change in perspective is a big one: I’m not even sure how my own thinking evolved over time, and so I don’t know how to encourage this shift in others. But I do believe that if we want to learn as much as we can from incidents, we need to work on changing how our fellow engineers think about what is knowable. And that’s a tall order.

The seductiveness of single-metric decisions

Making decisions is hard.

One technique to help with making a decision is to compute a single metric for each of the options being considered, and then compare the value of those two metrics. A common metric for this situation is to use dollars or ROI (return on investment, which is a unitless ratio of dollars). Are you trying to decide between two internal software development projects? Estimate the ROI for each one and pick the larger one. OKRs (objectives and key results) and error budgets are two other examples of driving decisions using individual metrics, like “where should we focus our effort now?” or “can we push this new feature to production?”

A single-metric-based approach has the virtue of simplifying the final stage in the decision-making process: we simply compare two numbers (either two metrics or a metric against a threshold) in order to make our decision. Yes, it requires mapping the different factors under consideration onto the metric, but it’s tractable, right?

The problem is that the process of mapping the relevant factors into the single metric always involves subjective judgments that ultimately discard information. For example, for ROI calculations, consider the work involved in considering the various different kinds of costs and benefits and mapping those into dollars. Information that should be taken into account when making the final decision vanishes from this process as these factors get mapped into an anemic scalar value.

The problem here isn’t the use of metrics. Rather, it’s the temptation to squeeze all of the relevant information into a form that is representable in a single metric. A single metric frees the decision maker from having to make a subjective judgment that involves very different-looking factors. That’s a hard thing to do, and it can make people uncomfortable.

W. Edwards Deming was famous for railing against numerical targets. Note that he wasn’t opposed to metrics. (He advocated for the value of professional statisticians and control charts). Rather, he was opposed to decisions that were made based on single metrics. Here are some quotes from his book Out of the crisis on this topic:

Focus on outcome (management by numbers, MBO, work standards, meet
specifications, zero defects, appraisal of performance) must be abolished,
leadership put in place.

Eliminate management by objective. Eliminate management by numbers,
numerical goals. Substitute leadership.

[M]anagement by numerical goal is an attempt to manage without knowledge of what to do, and in fact is usually management by fear.

Deming uses the term “leadership” as the alternative to the decision-by-single-metric approach. I interpret that term as the ability of a manager to synthesize information from multiple sources in order to make a decision holistically. It’s a lot harder than mapping all of the factors into a single metric. But nobody ever said being an effective leader is easy.

Root cause: line in Shakespearean play

News recently broke about the crash of Ethiopian Airlines Flight 302. This post is about a different plane crash, Eastern Airlines Flight 375, in 1960. Flight 375 crashed on takeoff from Logan airport in Boston when it flew into a flock of birds. More specifically, in the words of Michael Kalafatas, it “slammed into a flock of ten thousand starlings“.

The starling isn’t native to North America. An American drug manufacturer named Eugene Schieffelin made multiple attempts to bring over different species of bird to the U.S. Many of his his efforts failed, but he was successful at bringing starlings over from Europe, releasing sixty European starlings in 1890 and another forty in 1891. Nate Dimeo recounts the story of the release of the sixty starlings in New York’s Central Park in episode 138 of he memory palace podcast.

Schieffelin’s interest included starlings because he wanted to bring over all of the birds mentioned in Shakespeare plays. The starling is mentioned only once in Shakespeare’s works: in Henry IV, Part I, in a line uttered by Sir Henry Percy:

Nay, I will; that’s flat: 
He said he would not ransom Mortimer; 
Forbad my tongue to speak of Mortimer;
But I will find him when he lies asleep, 
And in his ear I’ll holla ‘Mortimer!’ 
Nay, 
I’ll have a starling shall be taught to speak 
Nothing but ‘Mortimer,’ and give it him
To keep his anger still in motion.

The story is a good example of the problems of using causal language to talk about incidents. I doubt an accident investigation report would list “line in 16th century play” as a cause. And, yet, if Shakespeare had not included that line in the play, or had substituted a different bird for a starling, the accident would not have happened.

Of course, this type of counterfactual reasoning isn’t useful at all, but that’s exactly the point. Whenever we start with an incident, we can always go further back in time and play “for want of a nail”: the place where we stop is determined by factors such as time constraints of the investigation and available information. Neither of those factors are properties of the incident itself.

William Shakespeare didn’t cause Flight 375 to crash, because “causes” don’t exist in the world. Instead, we construct causes when we look backwards from incidents. We do this because of our need to make sense of the world. But the world is a messy, tangled web of interactions. Those causes aren’t real. It’s only by moving beyond the notion of causes that we can learn more about how those incidents came to be.

The danger of “insufficient virtue”

Nate Dimeo hosts a great storytelling podcast called The Memory Palace, where each episode is a short historical vignette. Episode 316: Ten Fingers, Ten Toes is about how people have tried to answer the question: “why are the bodies of some babies drastically different from the bodies of all others?”

The stories in this podcast usually aren’t personal, but this episode is an exception. Dimeo recounts how his great-aunt, Anna, was born without fingers on her left hand. Anna’s mother (Dimeo’s great-grandmother) blamed herself: when pregnant, she had been startled by a salesman knocking on the back door, and had bitten her knuckles. She had attributed the birth defect to her knuckle-biting.

We humans seem to be wired to attribute negative outcomes to behaving insufficiently virtuously. This is particularly apparent in the writing style of many management books. Here are some quotes from a book I’m currently reading.

For years, for example, American manufacturers thought they had to choose between low cost and high quality… They didn’t realize that they could have both goals, if they were willing to wait for one while they focused on the other.

Whenever a company fails, people always point to specific events to explain the “causes” of the failure: product problems, inept managers, loss of key people, unexpectedly aggressive competition, or business downturns. Yet, the deeper systemic causes for unsustained growth go unrecognized.

Why wasn’t that balancing process noticed? First, WonderTech’s financially oriented top management did not pay much attention to their delivery service. They mainly tracked sales, profits, return on investment, and market share. So long as these were healthy, delivery times were the least of their concerns.

Such litanies of “negative visions” are sadly commonplace, even among very successful people. They are the byproduct of a lifetime of fitting in, of coping, of problem solving. As a teenager in one of our programs once said, “We shouldn’t call them ‘grown ups’ we should call them ‘given ups.’

Peter Senge, The Fifth Discipline

In this book (The Fifth Discipline), Senge associates the principles he is advocating for (e.g., systems thinking, personal mastery, shared vision) with virtue, and the absence of these principles with vice. The book is filled with morality tales of the poor fates of companies due to insufficiently virtuous executives, to the point where I feel like I’m reading Goofus and Gallant comics.

This type of moralized thinking, where poor outcomes are caused by insufficiently virtuous behavior, is a cancer on our ability to understand incidents. It’s seductive to blame an incident on someone being greedy (an executive) or sloppy (an operator) or incompetent (a software engineer). Just think back to your reactions to incidents like the Equifax Data Breach or the California wildfires.

The temptation to attribute responsibility when bad things happen is overwhelming. You can always find greed, sloppiness, and incompetence if that’s what you’re looking for. We need to fight that urge. When trying to understand how an incident happened, we need to assume that all of the people involved were acting reasonably given the information they had the time. It means the difference between explaining incidents away, and learning from them.

(Oh, and you’ll probably want to check out the Field Guide to Understanding ‘Human Error’ by Sidney Dekker).

Notes on David Woods’s Resilience Engineering short course

David Woods has a great series of free online lectures on resilience engineering. After watching those lectures, a lot of the material clicked for me in a way that it never really did from reading his papers.

Woods writes about systems at a very general level: the principles he describes could apply to cells, organs, organisms, individuals, teams, departments, companies, ecosystems, socio-technical systems, pretty much anything you could describe using the word “system”. This generality means that he often uses abstract concepts, which apply to all such systems. For example, Woods talks about units of adaptive behavior, competence envelopes, and florescence. Abstractions that apply in a wide variety of contexts are very powerful, but reading about them is often tough going (cf. category theory).

In the short course lectures, Woods really brings these concepts to life. He’s an animated speaker (especially when you watch him at 2X speed). It’s about twenty hours of lectures, and he packs a lot of concepts into those twenty hours.

I made an effort to take notes as I watched the lectures. I’ve posted my notes to GitHub. But, really, you should watch the videos yourself. It’s the best way to get an overview about what resilience engineering is all about.