Why incidents can’t be monocausal

When an incident happens, the temptation is strong to identify a single cause. It’s as if the system is a chain, and we’re looking for the weak link that was responsible for the chain breaking. But, in organizations that are going concerns, that isn’t how the system works. It can’t be, because there are simply too many things that can and do go wrong. Think of all the touch points in your system, how many opportunities there are for problems (bugs, typo in config, bad data, …). If any one of these was enough to take down the system, then it would be like a house of cards, falling down all of the time.

What happens in successful organizations as that the system evolves layers of defense, so that it can survive the kinds of individual problems that are always cropping up. Sure, the system still goes down, and more often than we would like. But the uptime is good enough that the company continues to survive.

Here’s an analogy that I’m borrowing from John Allspaw. Think about a significant new feature or service that your organization delivered successfully: one that took multiple quarters and required the collaboration of multiple teams. I’d wager that there were many factors that contributed to the success of this effort. Imagine if someone asked you: “what was the root cause for the success of this feature?”

So it is with incidents. Because an organization can’t prevent the occurrence of individual problems, the system evolves defenses to protect itself, created by the everyday work of the people in the company. Sure, the code we write might not even compile on the first try, but somehow the code that made it out to production is running well enough that the company is still in business. People are doing checks on the system all of the time, and most of this work is invisible.

For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.

Elegance: UI vs implementation

If you ask the question, “What is a Docker container?”, it turns out that the Linux operating system doesn’t actually have a notion of a container at all. Instead, a Docker container refers to a cobbled together set of Linux technologies, such as cgroups, network namespaces, and union filesystems. However, from the point of view of the end-user, a container is very much a real thing. In particular, it exposes to the user images as an entity, and a command-line tool for pulling down images from a repository and running them.

The Docker container implementation may be built (in the Unix tradition!) with duct table and baling wire, but the user interface is elegant. It’s easy for a new user to get started with Docker once they’ve installed it. Bryan Cantrill points out that the advantage of Docker over container technologies developed in the BSD world is Docker’s notion of images as effectively static binaries that allow developers to think operationally and move faster.

Contrast that with git. The implementation is quite elegant: git represents the data under version control as a hash tree, with pointers into nodes in the tree. Git commands are tree manipulations: adding leaf nodes, moving branches from one part of the tree to another, smooshing nodes together, and so on. (I used Subversion for years and had no idea what was going on under the hood).

On the other hand, the command-line interface that git exposes is a nightmare. It’s so hard to use that trying to build a more usable command-line interface is a full-blown academic research project.

The elegance of a user interface and the elegance of an implementation are orthogonal. One doesn’t necessarily lead to the other.

Inductive invariants

Note: to understand this post, it helps to have some familiarity with TLA+.

Leslie Lamport is a big advocate of inductive invariants. In Teaching Concurrency, he introduces the following simple concurrent algorithm:

Consider N processes numbered from 0 through N−1 in which each process i executes:

x[i] := 1;y[i] := x[(i−1) mod N]

and stops, where each x[i] initially equals 0.

Lamport continues:

This algorithm satisfies the following property: after every process has stopped, y[i] equals 1 for at least one process i.

The algorithm satisfies this property because it maintains an inductive invariant [emphasis added]. Do you know what that invariant is? If not, then you do not completely understand why the algorithm satisfies this property. How can a computer engineer design a correct concurrent system without understanding it?

Before I read up on TLA+, I was familiar with the idea of an invariant, a property that is always true throughout the execution of a program. But I didn’t know what an inductive invariant is.

Mathematical induction

An inductive invariant of a specification is an invariant you can prove using mathematical induction. You probably had to do proofs by induction in college or high school math.

To refresh your memory, let’s say you were asked to prove the following formula:

The way this is proved this by induction is:

1. Base case: Prove it for n=1
2. Inductive case: Assume it is true for n, and prove that it holds for n+1

In TLA+, you typically specify the allowed behaviors of a specification by writing two predicates:

1. The initial conditions, conventionally written as Init
2. The allowed state transitions (which are called steps in TLA+), conventionally written as Next

Let’s say we have a state predicate which we believe is an inductive invariant, which I’ll call Inv. To prove that Inv is an inductive invariant, you need to:

1. Base case: Prove that Inv is true when Init is true
2. Inductive case: Assume that Inv holds for an arbitrary state s, prove that Inv holds for an arbitrary state t where the step s→t is one of the state transitions permitted by Next.

In TLA+ syntax, the top-level structured proof of an inductive invariant looks like this:

THEOREM Spec=>[]Inv<1>1. Init => Inv<1>2. Inv /\ [Next]_vars => Inv'<1>3. QED   BY <1>1,<1>2

In practice, you want to prove some invariant that is not an inductive invariant. The proof is typically structured like this:

Correct = ...        \* The invariant you really want to proveInv = ... /\ Correct \* the inductive invariantTHEOREM Spec=>[]Correct <1>1. Init => Inv <1>2. Inv /\ [Next]_vars => Inv' <1>3. Inv => Correct <1>4. QED     BY <1>1, <1>2, <1>3

Not all invariants are inductive

If you’re using TLA+ to check the behavior of a system, you’ll have an invariant that you want to check, often referred to as a correctness property or a safety property. However, not all such invariants are inductive invariants. They usually aren’t.

As an example, here’s a simple algorithm that starts a counter at either 1 or 2, and then decrements it until it reaches 0. The algorithm is written in PlusCal notation.

--algorithm Decrement variable i \in {1, 2}; begin Outer:     while i>0 do     Inner:         i := i-1;     end while end algorithm

It should be obvious that i≥0 is an invariant of this specification. Let’s call it Inv. But Inv isn’t an inductive invariant. To understand why, consider the following program state:

pc="Inner"i=0

Here, Inv  is true, but Inv’ is false, because on the next step of the behavior, i will be decremented to -1.

Consider the above diagram, which is associated associated with some TLA+ specification (not the one described above). A bubble represents a state in TLA+. A bubble is colored gray if it is an initial state in the specification (i.e. if the Init predicate is true of that state). An arrow represents a step, where the Next action is true for the pair of states associated with the arrow.

The inner region shows all of the states that are in all allowed behaviors of
a specification. The outer region represents an invariant: it contains all states where the invariant holds. The middle region represents an inductive invariant: it contains all states where an inductive invariant holds.

Note how there is an arrow from a state where the invariant holds to a state
where the invariant doesn’t hold. That’s because invariants are not inductive in general.

In contrast, for states where the inductive invariant holds, all arrows that
start in those states terminate in states where the inductive invariant holds.

Finding an inductive invariant

The hardest part of proof by inductive invariance is finding the inductive invariant for your specification. If the invariant you come up with isn’t inductive, you won’t be able to prove it by induction.

You can use TLC to help find an inductive invariant. See Using TLC to Check Inductive Invariance for more details.

I want to learn more!

The TLA+ Hyperbook has a section called The TLA+ Proof Track about how to write proofs in TLA+ that can be checked mechanically using TLAPS.

I’ve only dabbled in writing proofs. Here are two that I’ve written that were checked by TLAPS, if you’re looking for simple examples:

TLA+ is hard to learn

I’m a fan of the formal specification language TLA+. With TLA+, you can build models of programs or systems, which helps to reason about their behavior.

TLA+ is particularly useful for reasoning about the behavior of multithread programs and distributed systems. By requiring you to specify behavior explicitly, it forces you to think about interleaving of events that you might not otherwise have considered.

The user base of TLA+ is quite small. I think one of the reasons that TLA+ isn’t very popular is that it’s difficult to learn. I think there are at least three concepts you need for TLA+ that give new users trouble:

• The universe as a state machine
• Modeling programs and systems with math
• Mathematical syntax

The universe as state machine

TLA+ uses a state machine model. It treats the universe as a collection of variables whose values vary over time.

A state machine in sense that TLA+ uses it is similar, but not exactly the same, as the finite state machines that software engineers are used to. In particular:

• A state machine in the TLA+ sense can have an infinite number of states.
• When software engineers think about state machines, they think about a specific object or component being implemented as a finite state machine. In TLA+, everything is modeled as a state machine.

The state machine view of systems will feel familiar if you have a background in physics, because physicists use the same approach for system modeling: they define a state variable that evolves over time. If you squint, a TLA+ specification looks identical to a system of first-order differential equations, and associated boundary conditions. But, for the average software engineer, the notion of an entire system as an evolving state variable is a new way of thinking.

The state machine approach requires a set of concepts that you need to understand. In particular, you need to understand behaviors, which requires that you understand statessteps, and actions. Steps can stutter, and actions may or may not be enabled. For example, here’s the definition of “enabled” (I’m writing this from memory):

An action a is enabled for a state s if there exists a state t such that a is true for the step s→t.

It took me a long time to internalize these concepts to the point where I could just write that out without consulting a source. For a newcomer, who wants to get up and running as quickly as possible, each new concept that requires effort to understand decreases the likelihood of adoption.

Modeling programs and systems with math

One of the commonalities across engineering disciplines is that they all work with mathematical models. These models are abstractions, objects that are simplified versions of the artifacts that we intend to build. That’s one of the thing that attracts me about TLA+, it’s modeling for software engineering.

A mechanical engineer is never going to confuse the finite element model they have constructed on a computer with the physical artifact that they are building. Unfortunately, we software engineers aren’t so lucky. Our models superficially resemble the artifacts we build (a TLA+ model and a computer program both look like source code!). But models aren’t programs: a model is a completely different beast, and that trips people up.

Here’s a metaphor: You can think of writing a program as akin to painting, in that both are additive work: You start with nothing and you do work by adding content (statements to your program, paint to a canvas).

The simplest program, equivalent to an empty canvas, is one that doesn’t do anything at all. On Unix systems, there’s a program called true which does nothing but terminate successfully. You can implement this in shell as an empty file. (Incidentally, AT&T has copyrighted this implementation).

By contrast, when you implement a model, you do the work by adding constraints on the behavior of the state variables. It’s more like sculpting, where you start with everything, and then you chip away at it until you end up with what you want.

The simplest model, the one with no constraints at all, allows all possible behaviors. Where the simplest computer program does nothing, the simplest model does (really allows) everything. The work of modeling is adding constraints to the possible behaviors such that the model only describes the behaviors we are interested in.

When we write ordinary programs, the only kind of mistake we can really make is a bug, writing a program that doesn’t do what it’s supposed to. When we write a model of a program, we can also make that kind of mistake. But, we can make another kind of mistake, where our model allows some behavior that would never actually happen in the real world, or isn’t even physically possible in the real world.

Engineers and physicists understand this kind of mistake, where a mathematical model permits a behavior that isn’t possible in the real world. For example, electrical engineers talk about causal filters, which are filters whose outputs only depend on the past and present, not the future. You might ask why you even need a word for this, since it’s not possible to build a non-causal physical device. But it’s possible, and even useful, to describe non-causal filters mathematically. And, indeed, it turns out that filters that perfectly block out a range of frequencies, are not causal.

For a new TLA+ user who doesn’t understand the distinction between models and programs, this kind of mistake is inconceivable, since it can’t happen when writing a regular program. Creating non-causal specifications (the software folks use the term “machine-closed” instead of “causal”) is not a typical error for new users, but underspecifying the behavior some variable of interest is very common.

Mathematical syntax

Many elements of TLA+ are taken directly from mathematics and logic. For software engineers used to programming language syntax, these can be confusing at first. If you haven’t studied predicate logic before, the universal (∀) and extensional (∃) quantifiers will be new.

I don’t think TLA+’s syntax, by itself, is a significant obstacle to adoption: software engineers pick up new languages with unfamiliar syntax all of the time. The real difficulty is in understanding TLA+’s notion of a state machine, and that modeling is describing a computer program as permitted behaviors of a state machine. The new syntax is just one more hurdle.

Why we will forever suffer from missing timeouts, TTLs, and queue size bounds

If you’ve operated a software service, you will have inevitably hit one of the following problems:

A network call with a missing timeout.  Some kind of remote procedure call or other networking call is blocked waiting … forever, because there’s no timeout configured on the call.

Missing time-to-live (TTL). Some data that was intended to be ephemeral did not explicitly have a TTL set on it, and it didn’t get removed by normal means, and so its unexpected presence bit you.

A queue with no explicit size limit. A queue somewhere doesn’t have an explicitly configured upper bound on its size, and somehow the producers are consistently outnumbering the consumers, and then your queue eventually grows to some size that you never expected to happen.

Unfortunately, the only good solution to these problems is diligence. We have to remember to explicitly set timeouts, TTLs, and queue sizes. there are two reasons:

It’s impossible for a library author to define a reasonable default for these values. Appropriate timeouts, TTLs, and queue sizes will vary enormously from one use case to another, there simply isn’t a “reasonable” value to pick without picking one so large that it’s basically the same as being unbounded.

Forcing users to always specify values is a lousy user experience for new users. Library authors could make these values required rather than optional. However, this makes it more annoying for new users of these libraries, it’s an extra step that forces them to make a decision they don’t really want to think about. They probably don’t even know what a reasonable value is when they’re first setting out.

I think forcing users to specify these limits would lead to more robust software, but I can see many users complaining about being forced to set these limits rather than defaulting to infinity. At least, that’s my guess about why library authors don’t do it.

The Tortoise and the Hare in TLA+

Problem: Determine if a linked list contains a cycle, using O(1) space.

Robert Floyd invented an algorithm for solving this problem in the late 60s, which is called “The Tortoise and the Hare”[1].

(This is supposedly a popular question to ask in technical interviews. I’m not a fan of expecting interviewees to re-invent the algorithms of famous computer scientists on the spot).

As an exercise, I implemented the algorithm in TLA+, using PlusCal.

The algorithm itself is very simple: the hardest part was deciding how to model linked lists in TLA+. We want to model the set of all linked lists, to express that algorithm should work for any element of the set. However, the model checker can only work with finite sets. The typical approach is to do something like “the set of all linked lists which contain up to N nodes”, and then run the checker against different values of N.

What I ended up doing was generating N nodes, giving each node a successor

(a next element, which could be the terminal value NIL), and then selecting

the start of the list from the set of nodes:

CONSTANTS N

ASSUME N \in Nat

Nodes == 1..N

NIL == CHOOSE NIL : NIL \notin Nodes

start \in Nodes
succ \in [Nodes -> Nodes \union {NIL}]


(Note: I originally defined NIL as a constant, but this is incorrect, see the comment by Ron Pressler for more details).
This is not the most efficient way from a model checker point of view, because the model checker will generate nodes that are irrelevant because they aren’t in the list. However, it does generate all possible linked lists up to length N.

Superficially, this doesn’t look like the pointer-and-structure linked list you’d see in C, but it behaves the same way at a high level. It’s possible to model a memory address space and pointers in TLA+, but I chose not to do so.

In addition, the nodes of a linked list typically have an associated value, but Floyd’s algorithm doesn’t use this value, so I didn’t model it.

Here’s my implementation of the algorithm:

EXTENDS Naturals

CONSTANTS N

ASSUME N \in Nat

Nodes == 1..N

NIL == CHOOSE NIL : NIL \notin Nodes

(*
--fair algorithm TortoiseAndHare

variables
start \in Nodes,
succ \in [Nodes -> Nodes \union {NIL}],
cycle, tortoise, hare, done;
begin
h0: tortoise := start;
hare := start;
done := FALSE;
h1: while ~done do
h2: tortoise := succ[tortoise];
hare := LET hare1 == succ[hare] IN
IF hare1 \in DOMAIN succ THEN succ[hare1] ELSE NIL;
h3: if tortoise = NIL \/ hare = NIL then
cycle := FALSE;
done := TRUE;
elsif tortoise = hare then
cycle := TRUE;
done := TRUE;
end if;
end while;

end algorithm
*)

I wanted to use the model checker to verify the the implementation was correct:

PartialCorrectness == pc="Done" => (cycle <=> HasCycle(start))


(See the Correctness wikipedia page for why this is called “partial correctness”).

To check correctness, I needed to implement my HasCycle operator (without resorting to Floyd’s algorithm). I used the transitive closure of the successor function for this, which is called TC here. If the transitive closure contains the pair (node, NIL), then the list that starts with node does not contain a cycle:

HasCycle(node) == LET R == {<<s, t>> \in Nodes \X (Nodes \union {NIL}): succ[s] = t }
IN <<node, NIL>> \notin TC(R)


The above definition is “cheating” a bit, since we’ve defined a cycle by what it isn’t. It also wouldn’t work if we allowed infinite lists, since those don’t have cycles nor do they have NIL nodes.

Here’s a better definition: a linked list that starts with node contains a cycle if there exists some node n that is reachable from node and from itself.

HasCycle2(node) ==
LET R == {<<s, t>> \in Nodes \X (Nodes \union {NIL}): succ[s] = t }
IN \E n \in Nodes : /\ <<node, n>> \in TC(R)
/\ <<n, n>> \in TC(R)


To implement the transitive closure in TLA+, I used an existing implementation

from the TLA+ repository itself:

TC(R) ==
LET Support(X) == {r[1] : r \in X} \cup {r[2] : r \in X}
S == Support(R)
RECURSIVE TCR(_)
TCR(T) == IF T = {}
THEN R
ELSE LET r == CHOOSE s \in T : TRUE
RR == TCR(T \ {r})
IN  RR \cup {<<s, t>> \in S \X S :
<<s, r>> \in RR /\ <<r, t>> \in RR}
IN  TCR(S)


The full model is the lorin/tla-tortoise-hare repo on GitHub.

1. Thanks to Reginald Braithwaite for the reference in his excellent book Javascript Allongé.  ↩

Sketching on the back end

In his essay entitled, The Cognitive View: A Different Look at Software Design, Robert Glass made the following observation about how software engineers do design work:

What we see here is a mental process, a very rapid process, an iterative process, a process in fact of fast trial and error. The mind forms a solution to the problem, knowing that it will be inadequate because the mind is not yet able to fully grasp all the facets of the problem. That problem solution, the mind knows, must be in the form of a model, because it is going to be necessary to try sample input against the model, run a quick simulation (inside the mind) on the model using the sample input, and get sample outputs (still inside the mind) from that simulation.

The essence of design, then, is rapid modeling and simulation. And a key factor in design is the ability to propose solutions and allow them to fail!

Design work incorporates trial and error. We think up a potential solution, and then try to evaluate whether it’s a good fit for the problem.

But not all potential solutions are so easy to evaluate entirely inside one’s head. In fields where designers are working on physical designs or visual interfaces, they use sketches to help with early solution evaluation, sometimes called formative evaluation. Sketching user experiences is a great book on this topic.

For those of us working on the back-end, we don’t think of this early evaluation of solutions as involving “sketching”, but the metaphor fits. One of the things I find appealing about the Alloy modeling language is that it allows you to sketch out different data models and check their properties.

In a recent blog post, Michael Fogus describes working with clojure.spec as a form of sketching. He also makes the following observation about JSON.

While Lisp programmers have long know the utility of sketching their structs, I’m convinced that one of the primary reasons that JSON has taken over the world is that it provides JS-direct syntactic literals that can be easy typed up and manipulated with little fuss. JSON is a perfectly adequate tool for sketching for many programmers.

I believe UML failed because the designers did not understand the role that box-and-arrow diagrams played in software engineering design. They wanted to build a visual modeling language with well-defined semantics: we wanted to sketch.