Root cause: line in Shakespearean play

News recently broke about the crash of Ethiopian Airlines Flight 302. This post is about a different plane crash, Eastern Airlines Flight 375, in 1960. Flight 375 crashed on takeoff from Logan airport in Boston when it flew into a flock of birds. More specifically, in the words of Michael Kalafatas, it “slammed into a flock of ten thousand starlings“.

The starling isn’t native to North America. An American drug manufacturer named Eugene Schieffelin made multiple attempts to bring over different species of bird to the U.S. Many of his his efforts failed, but he was successful at bringing starlings over from Europe, releasing sixty European starlings in 1890 and another forty in 1891. Nate Dimeo recounts the story of the release of the sixty starlings in New York’s Central Park in episode 138 of he memory palace podcast.

Schieffelin’s interest included starlings because he wanted to bring over all of the birds mentioned in Shakespeare plays. The starling is mentioned only once in Shakespeare’s works: in Henry IV, Part I, in a line uttered by Sir Henry Percy:

Nay, I will; that’s flat: 
He said he would not ransom Mortimer; 
Forbad my tongue to speak of Mortimer;
But I will find him when he lies asleep, 
And in his ear I’ll holla ‘Mortimer!’ 
Nay, 
I’ll have a starling shall be taught to speak 
Nothing but ‘Mortimer,’ and give it him
To keep his anger still in motion.

The story is a good example of the problems of using causal language to talk about incidents. I doubt an accident investigation report would list “line in 16th century play” as a cause. And, yet, if Shakespeare had not included that line in the play, or had substituted a different bird for a starling, the accident would not have happened.

Of course, this type of counterfactual reasoning isn’t useful at all, but that’s exactly the point. Whenever we start with an incident, we can always go further back in time and play “for want of a nail”: the place where we stop is determined by factors such as time constraints of the investigation and available information. Neither of those factors are properties of the incident itself.

William Shakespeare didn’t cause Flight 375 to crash, because “causes” don’t exist in the world. Instead, we construct causes when we look backwards from incidents. We do this because of our need to make sense of the world. But the world is a messy, tangled web of interactions. Those causes aren’t real. It’s only by moving beyond the notion of causes that we can learn more about how those incidents came to be.

The danger of “insufficient virtue”

Nate Dimeo hosts a great storytelling podcast called The Memory Palace, where each episode is a short historical vignette. Episode 316: Ten Fingers, Ten Toes is about how people have tried to answer the question: “why are the bodies of some babies drastically different from the bodies of all others?”

The stories in this podcast usually aren’t personal, but this episode is an exception. Dimeo recounts how his great-aunt, Anna, was born without fingers on her left hand. Anna’s mother (Dimeo’s great-grandmother) blamed herself: when pregnant, she had been startled by a salesman knocking on the back door, and had bitten her knuckles. She had attributed the birth defect to her knuckle-biting.

We humans seem to be wired to attribute negative outcomes to behaving insufficiently virtuously. This is particularly apparent in the writing style of many management books. Here are some quotes from a book I’m currently reading.

For years, for example, American manufacturers thought they had to choose between low cost and high quality… They didn’t realize that they could have both goals, if they were willing to wait for one while they focused on the other.

Whenever a company fails, people always point to specific events to explain the “causes” of the failure: product problems, inept managers, loss of key people, unexpectedly aggressive competition, or business downturns. Yet, the deeper systemic causes for unsustained growth go unrecognized.

Why wasn’t that balancing process noticed? First, WonderTech’s financially oriented top management did not pay much attention to their delivery service. They mainly tracked sales, profits, return on investment, and market share. So long as these were healthy, delivery times were the least of their concerns.

Such litanies of “negative visions” are sadly commonplace, even among very successful people. They are the byproduct of a lifetime of fitting in, of coping, of problem solving. As a teenager in one of our programs once said, “We shouldn’t call them ‘grown ups’ we should call them ‘given ups.’

Peter Senge, The Fifth Discipline

In this book (The Fifth Discipline), Senge associates the principles he is advocating for (e.g., systems thinking, personal mastery, shared vision) with virtue, and the absence of these principles with vice. The book is filled with morality tales of the poor fates of companies due to insufficiently virtuous executives, to the point where I feel like I’m reading Goofus and Gallant comics.

This type of moralized thinking, where poor outcomes are caused by insufficiently virtuous behavior, is a cancer on our ability to understand incidents. It’s seductive to blame an incident on someone being greedy (an executive) or sloppy (an operator) or incompetent (a software engineer). Just think back to your reactions to incidents like the Equifax Data Breach or the California wildfires.

The temptation to attribute responsibility when bad things happen is overwhelming. You can always find greed, sloppiness, and incompetence if that’s what you’re looking for. We need to fight that urge. When trying to understand how an incident happened, we need to assume that all of the people involved were acting reasonably given the information they had the time. It means the difference between explaining incidents away, and learning from them.

(Oh, and you’ll probably want to check out the Field Guide to Understanding ‘Human Error’ by Sidney Dekker).

Notes on David Woods’s Resilience Engineering short course

David Woods has a great series of free online lectures on resilience engineering. After watching those lectures, a lot of the material clicked for me in a way that it never really did from reading his papers.

Woods writes about systems at a very general level: the principles he describes could apply to cells, organs, organisms, individuals, teams, departments, companies, ecosystems, socio-technical systems, pretty much anything you could describe using the word “system”. This generality means that he often uses abstract concepts, which apply to all such systems. For example, Woods talks about units of adaptive behavior, competence envelopes, and florescence. Abstractions that apply in a wide variety of contexts are very powerful, but reading about them is often tough going (cf. category theory).

In the short course lectures, Woods really brings these concepts to life. He’s an animated speaker (especially when you watch him at 2X speed). It’s about twenty hours of lectures, and he packs a lot of concepts into those twenty hours.

I made an effort to take notes as I watched the lectures. I’ve posted my notes to GitHub. But, really, you should watch the videos yourself. It’s the best way to get an overview about what resilience engineering is all about.

Our brittle serverless future

I’m really enjoying David Woods’s Resilience Engineering short course videos. In Lecture 9, Woods mentions an important ingredient in a resilient system: the ability to monitor how hard you are working to stay in control of the system.

I was thinking of this observation in the context of serverless computing. In serverless, software engineers offload the responsibility of resource management to a third-party organization, who handles this transparently for them. No more thinking in terms of servers, instance types, CPU utilization and memory usage!

The challenge is this: from the perspective of a customer of a serverless provider, you don’t have visibility into how hard the provider is working to stay in control. If the underlying infrastructure is nearing some limit (e.g., amount of incoming traffic it can handle), or if it’s operating in degraded mode because of an internal failure, these challenges are invisible to you as a customer.

Woods calls this phenomenon the veil of fluency. From the customer’s perspective, everything is fine. Your SLOs are all still being met! However, from the provider’s perspective, the system may be very close to the boundary, the point where it falls over.

Woods also talks about the importance of reciprocity in resilient organizations: how different units of adaptive behavior synchronize effectively when a crunch happens and one of them comes under pressure. In a serverless environment, you lose reciprocity because there’s a hard boundary between the serverless provider and a customer. If your system is deployed in a serverless environment, and a major incident happens where the serverless system is a contributing factor, nobody from your serverless provider is going to be in the Slack channel or on the conference bridge.

I think Simon Wardley is correct in his prediction that serverless is the future of software deployment. The tools are still immature today, but they’ll get there. And systems built on serverless will likely be more robust, because the providers will have more expertise in resource management and fault tolerance than their customers do.

But every system eventually reaches its limit. One day a large-scale serverless-based software system is going to go past the limit of what it can handle. And when it breaks, I think it’s going to break quickly, without warning, from the customer’s perspective. And you won’t be able to coordinate with the engineers at your serverless provider to bring the system back into a good state, because all you’ll have are a set of APIs.

Inductive invariants

Note: to understand this post, it helps to have some familiarity with TLA+. 

Leslie Lamport is a big advocate of inductive invariants. In Teaching Concurrency, he introduces the following simple concurrent algorithm:

Consider N processes numbered from 0 through N−1 in which each process i executes:

x[i] := 1;
y[i] := x[(i−1) mod N]

and stops, where each x[i] initially equals 0.

Lamport continues:

 This algorithm satisfies the following property: after every process has stopped, y[i] equals 1 for at least one process i.

The algorithm satisfies this property because it maintains an inductive invariant [emphasis added]. Do you know what that invariant is? If not, then you do not completely understand why the algorithm satisfies this property. How can a computer engineer design a correct concurrent system without understanding it?

Before I read up on TLA+, I was familiar with the idea of an invariant, a property that is always true throughout the execution of a program. But I didn’t know what an inductive invariant is.

Mathematical induction

An inductive invariant of a specification is an invariant you can prove using mathematical induction. You probably had to do proofs by induction in college or high school math.

To refresh your memory, let’s say you were asked to prove the following formula:

The way this is proved this by induction is:

  1. Base case: Prove it for n=1
  2. Inductive case: Assume it is true for n, and prove that it holds for n+1

In TLA+, you typically specify the allowed behaviors of a specification by writing two predicates:

  1. The initial conditions, conventionally written as Init
  2. The allowed state transitions (which are called steps in TLA+), conventionally written as Next

Let’s say we have a state predicate which we believe is an inductive invariant, which I’ll call Inv. To prove that Inv is an inductive invariant, you need to:

  1. Base case: Prove that Inv is true when Init is true
  2. Inductive case: Assume that Inv holds for an arbitrary state s, prove that Inv holds for an arbitrary state t where the step s→t is one of the state transitions permitted by Next.

In TLA+ syntax, the top-level structured proof of an inductive invariant looks like this:

THEOREM Spec=>[]Inv
<1>1. Init => Inv
<1>2. Inv /\ [Next]_vars => Inv'
<1>3. QED
BY <1>1,<1>2

In practice, you want to prove some invariant that is not an inductive invariant. The proof is typically structured like this:

Correct = ...        \* The invariant you really want to prove
Inv = ... /\ Correct \* the inductive invariant

THEOREM Spec=>[]Correct
<1>1. Init => Inv
<1>2. Inv /\ [Next]_vars => Inv'
<1>3. Inv => Correct
<1>4. QED
BY <1>1, <1>2, <1>3

Not all invariants are inductive

If you’re using TLA+ to check the behavior of a system, you’ll have an invariant that you want to check, often referred to as a correctness property or a safety property. However, not all such invariants are inductive invariants. They usually aren’t.

As an example, here’s a simple algorithm that starts a counter at either 1 or 2, and then decrements it until it reaches 0. The algorithm is written in PlusCal notation.

--algorithm Decrement
variable i \in {1, 2};
begin
Outer:
while i>0 do
Inner:
i := i-1;
end while
end algorithm

It should be obvious that i≥0 is an invariant of this specification. Let’s call it Inv. But Inv isn’t an inductive invariant. To understand why, consider the following program state:

pc="Inner"
i=0

Here, Inv  is true, but Inv’ is false, because on the next step of the behavior, i will be decremented to -1.

Consider the above diagram, which is associated associated with some TLA+ specification (not the one described above). A bubble represents a state in TLA+. A bubble is colored gray if it is an initial state in the specification (i.e. if the Init predicate is true of that state). An arrow represents a step, where the Next action is true for the pair of states associated with the arrow.

The inner region shows all of the states that are in all allowed behaviors of
a specification. The outer region represents an invariant: it contains all states where the invariant holds. The middle region represents an inductive invariant: it contains all states where an inductive invariant holds.

Note how there is an arrow from a state where the invariant holds to a state
where the invariant doesn’t hold. That’s because invariants are not inductive in general.

In contrast, for states where the inductive invariant holds, all arrows that
start in those states terminate in states where the inductive invariant holds.

Finding an inductive invariant

The hardest part of proof by inductive invariance is finding the inductive invariant for your specification. If the invariant you come up with isn’t inductive, you won’t be able to prove it by induction.

You can use TLC to help find an inductive invariant. See Using TLC to Check Inductive Invariance for more details.

I want to learn more!

The TLA+ Hyperbook has a section called The TLA+ Proof Track about how to write proofs in TLA+ that can be checked mechanically using TLAPS.

I’ve only dabbled in writing proofs. Here are two that I’ve written that were checked by TLAPS, if you’re looking for simple examples:

TLA+ is hard to learn

I’m a fan of the formal specification language TLA+. With TLA+, you can build models of programs or systems, which helps to reason about their behavior.

TLA+ is particularly useful for reasoning about the behavior of multithread programs and distributed systems. By requiring you to specify behavior explicitly, it forces you to think about interleaving of events that you might not otherwise have considered.

The user base of TLA+ is quite small. I think one of the reasons that TLA+ isn’t very popular is that it’s difficult to learn. I think there are at least three concepts you need for TLA+ that give new users trouble:

  • The universe as a state machine
  • Modeling programs and systems with math
  • Mathematical syntax

The universe as state machine

TLA+ uses a state machine model. It treats the universe as a collection of variables whose values vary over time.

A state machine in sense that TLA+ uses it is similar, but not exactly the same, as the finite state machines that software engineers are used to. In particular:

  • A state machine in the TLA+ sense can have an infinite number of states.
  • When software engineers think about state machines, they think about a specific object or component being implemented as a finite state machine. In TLA+, everything is modeled as a state machine.

The state machine view of systems will feel familiar if you have a background in physics, because physicists use the same approach for system modeling: they define a state variable that evolves over time. If you squint, a TLA+ specification looks identical to a system of first-order differential equations, and associated boundary conditions. But, for the average software engineer, the notion of an entire system as an evolving state variable is a new way of thinking.

The state machine approach requires a set of concepts that you need to understand. In particular, you need to understand behaviors, which requires that you understand statessteps, and actions. Steps can stutter, and actions may or may not be enabled. For example, here’s the definition of “enabled” (I’m writing this from memory):

An action a is enabled for a state s if there exists a state t such that a is true for the step s→t.

It took me a long time to internalize these concepts to the point where I could just write that out without consulting a source. For a newcomer, who wants to get up and running as quickly as possible, each new concept that requires effort to understand decreases the likelihood of adoption.

Modeling programs and systems with math

One of the commonalities across engineering disciplines is that they all work with mathematical models. These models are abstractions, objects that are simplified versions of the artifacts that we intend to build. That’s one of the thing that attracts me about TLA+, it’s modeling for software engineering.

A mechanical engineer is never going to confuse the finite element model they have constructed on a computer with the physical artifact that they are building. Unfortunately, we software engineers aren’t so lucky. Our models superficially resemble the artifacts we build (a TLA+ model and a computer program both look like source code!). But models aren’t programs: a model is a completely different beast, and that trips people up.

Here’s a metaphor: You can think of writing a program as akin to painting, in that both are additive work: You start with nothing and you do work by adding content (statements to your program, paint to a canvas).

The simplest program, equivalent to an empty canvas, is one that doesn’t do anything at all. On Unix systems, there’s a program called true which does nothing but terminate successfully. You can implement this in shell as an empty file. (Incidentally, AT&T has copyrighted this implementation).

By contrast, when you implement a model, you do the work by adding constraints on the behavior of the state variables. It’s more like sculpting, where you start with everything, and then you chip away at it until you end up with what you want.

The simplest model, the one with no constraints at all, allows all possible behaviors. Where the simplest computer program does nothing, the simplest model does (really allows) everything. The work of modeling is adding constraints to the possible behaviors such that the model only describes the behaviors we are interested in.

When we write ordinary programs, the only kind of mistake we can really make is a bug, writing a program that doesn’t do what it’s supposed to. When we write a model of a program, we can also make that kind of mistake. But, we can make another kind of mistake, where our model allows some behavior that would never actually happen in the real world, or isn’t even physically possible in the real world.

Engineers and physicists understand this kind of mistake, where a mathematical model permits a behavior that isn’t possible in the real world. For example, electrical engineers talk about causal filters, which are filters whose outputs only depend on the past and present, not the future. You might ask why you even need a word for this, since it’s not possible to build a non-causal physical device. But it’s possible, and even useful, to describe non-causal filters mathematically. And, indeed, it turns out that filters that perfectly block out a range of frequencies, are not causal.

For a new TLA+ user who doesn’t understand the distinction between models and programs, this kind of mistake is inconceivable, since it can’t happen when writing a regular program. Creating non-causal specifications (the software folks use the term “machine-closed” instead of “causal”) is not a typical error for new users, but underspecifying the behavior some variable of interest is very common.

Mathematical syntax

Many elements of TLA+ are taken directly from mathematics and logic. For software engineers used to programming language syntax, these can be confusing at first. If you haven’t studied predicate logic before, the universal (∀) and extensional (∃) quantifiers will be new.

I don’t think TLA+’s syntax, by itself, is a significant obstacle to adoption: software engineers pick up new languages with unfamiliar syntax all of the time. The real difficulty is in understanding TLA+’s notion of a state machine, and that modeling is describing a computer program as permitted behaviors of a state machine. The new syntax is just one more hurdle.

Why we will forever suffer from missing timeouts, TTLs, and queue size bounds

If you’ve operated a software service, you will have inevitably hit one of the following problems:

A network call with a missing timeout.  Some kind of remote procedure call or other networking call is blocked waiting … forever, because there’s no timeout configured on the call.

Missing time-to-live (TTL). Some data that was intended to be ephemeral did not explicitly have a TTL set on it, and it didn’t get removed by normal means, and so its unexpected presence bit you.

A queue with no explicit size limit. A queue somewhere doesn’t have an explicitly configured upper bound on its size, and somehow the producers are consistently outnumbering the consumers, and then your queue eventually grows to some size that you never expected to happen.

Unfortunately, the only good solution to these problems is diligence. We have to remember to explicitly set timeouts, TTLs, and queue sizes. there are two reasons:

It’s impossible for a library author to define a reasonable default for these values. Appropriate timeouts, TTLs, and queue sizes will vary enormously from one use case to another, there simply isn’t a “reasonable” value to pick without picking one so large that it’s basically the same as being unbounded.

Forcing users to always specify values is a lousy user experience for new users. Library authors could make these values required rather than optional. However, this makes it more annoying for new users of these libraries, it’s an extra step that forces them to make a decision they don’t really want to think about. They probably don’t even know what a reasonable value is when they’re first setting out.

I think forcing users to specify these limits would lead to more robust software, but I can see many users complaining about being forced to set these limits rather than defaulting to infinity. At least, that’s my guess about why library authors don’t do it.