Conveying confusion without confusing the reader

Confusion is a hallmark of a complex incident. In the moment, we know something is wrong, but we struggle to make sense of the different signals that we’re seeing. We don’t understand the underlying failure mode.

After the incident is over and the engineers have had a chance to dig into what happened, these confusing signals make sense in retrospect. We find out that about the bug or inadvertent config change or unexpected data corruption that led to the symptoms we saw during the incident.

When writing up the narrative, the incident investigator must choose whether to inform the reader in advance about the details of the failure mode, or to withhold this info until the point in time in the narrative when the engineers involved understood what was happening.

I prefer the first approach: giving the reader information about the failure mode details in the narrative before the actors involved in the incident have that information. This enables the reader to make sense of the strange, anomalous signals in a way that the engineers in the moment were not able to.

I do this because, as a reader, I don’t enjoy the feeling of being confused: I’m not looking for a mystery when I read a writeup. If I’m reading about a series of confusing signals that engineers are looking at (e.g., traffic spikes, RPC errors), and I can’t make sense of them either, I tend to get bored. It’s just a mess of confusion.

On the other hand, if I know why these signals are happening, but the characters in the story don’t know, then that is more effective in creating tension in my mind. I want to read on to resolve the tension, to figure out how the engineers ended up diagnosing the problem.

When informing the reader about the failure mode in advance, the challenge is to avoid infecting the reader with hindsight bias. If the reader thinks, “the problem was obviously X. How could they not see it?”, then I’ve failed in the writeup. What I try to do is put the reader into the head of the people involved as much as possible: to try to convey the confusion they were experiencing in the moment, and the source of that confusion.

By enabling the reader to identify with the people involved, you can communicate to the reader how confusing the situation was to the people involved, without directly inflicting that same confusion upon them.

Climbing the mountain

When I was in high school, I attended a Jewish weekend retreat in the Laurentian Mountains of Quebec1. While most of the attendees were secular Jews like me, one of them was a Chabadnik, and several us got into a discussion about Judaism and scholarship.

One of the secular Jews lamented that it was an insurmountable task to properly understand Judaism: there were just too many texts you had to study. If we were lucky, we knew a little Hebrew, but certainly not enough to study the Hebrew texts (let alone the texts in other languages!).

The Chabadnik offered the following metaphor. Imagine a mountain, with an impossibly high peak. Studying Judaism is like climbing the mountain. People who have previously studied material will be higher up on the mountain than those who haven’t studied as much. However, regardless of your current elevation, you can always climb higher than where you are, by studying material appropriate for your level.

So it is with learning more about resilience engineering. Fortunately for those who seek to learn more about resilience, it’s a much younger field than Judaism. You need contend with only decades of scholarship, rather than centuries. Still, being confronted with decades of research papers can be intimidating. But don’t let that stop you from trying to learn just a little bit more than you currently know.

I once heard Richard Cook say that the most effective way to get better at analyzing incidents was to first study how incidents happen in a field other than your own. Most of us will never have the opportunity to devote years of study to a different field! On the other hand, he also said that having a ten-to-fifteen-minute huddle after an incident to discuss what happened can also be a very effective learning mechanism.

You don’t need to read mountains of papers to start getting better at learning from incidents. It can be as simple as asking different kinds of questions in retrospectives (e.g., “When you saw the alert go off, what did you do next?”). One of the things I really like about resilience engineering is how it values expertise borne out of experience. I think you’ll learn more by trying out different questions to ask in incident retros than you will from reading the papers. (Although reading the papers will eventually help you ask better questions).

Diane Vaughan, a sociology researcher, spent six years studying a single incident! That’s a standard that none of us can hope to meet. And that means we won’t obtain the depth of insight that Vaughan was able to in her investigation, but that’s ok.

Don’t be intimidated by the height of the mountain. Don’t worry about reaching the top (there isn’t one), or even reaching a certain height. The important thing is to ascend: to work to climb higher than you currently are.

1 I attended a Jewish elementary school, but a public high school. In high school, my parents encouraged me to attend these sorts of programs to maintain some semblance of Jewish identity.

Taking a risk versus running a risk

In the wake of an incident, we can often identify a risky action that was taken by an engineer that contributed to the incident. However, actions that look risky to us in retrospect didn’t necessarily look risky to the engineer who took the action in the moment. In the SINTEF A17034 report on Organizational Accidents and Resilient Organisations: Six Perspectives, the authors draw a distinction between taking a risk and running a risk.

When you take a risk, you are taking an action that you know to be risky. When an engineer says they are YOLO’ing a change, they’re taking a risk.

On the other hand, running a risk refers to taking a course of action that is not believed to be risky. These are the kinds of actions that we only categorize as risky in hindsight, when we have more information than the engineer who took the course of action in the moment.

Sometimes we deliberately take a risk because we believe there is greater risk if we don’t take action. But running a risk is never deliberate, because we didn’t know the risk was there in the first place.

Stories as a vehicle for learning from the experience of others

Senior software engineering positions command higher salaries than junior positions. The industry believes (correctly, I think) that engineers become more effective as they accumulate experience, and that perception is reflected in market salaries.

Learning from direct experience is powerful, but there’s a limit to the rate at which we can learn from our own experiences. Certainly, we learn more from some experiences than others; we joke about “ten years of experience” versus “one year of experience ten times over”, as well as using scars as a metaphor for these sometimes unpleasant but more impactful experiences. But there’s only so many hours in a day, and we may not always be…errr… lucky enough to be exposed to many high-value learning opportunities.

There’s another resource we can draw on besides our own direct experience, and that’s the experiences of peers in our organization. Learning from the experiences of others isn’t as effective as learning directly from our own experience. But, if the organization you work in is large enough, then high-value learning opportunities are probably happening around you all of the time.

Given these opportunities abound, the challenge is: how can we learn effectively from the experiences of others? One way that humans learn from others is through telling stories.

Storytelling enables a third person to experience events by proxy. When we tell a story well, we run a simulation of the events in the mind of the listener. This kind of experience is not as effective as the first-hand kind, but it still leaves an impression on the listener when done well. In addition, storytelling scales very well: we can write down stories, or record them, and then publish these across the organization.

A second challenge is: what stories should we tell? It turns out that incidents make great stories. You’ll often hear engineers tell tales of incidents to each other. We sometimes calling these war stories, horror stories (the term I prefer), or ghost stories.

Once we recognize the opportunity of using incidents as a mechanism for second-hand-experiential-learning-through-storytelling, this shifts our thinking about the role and structure of an incident writeup. We want to tell a story that captures the experiences of the people involved in the incident, so that the readers can imagine what is was like, in the moment, when the alerts were going off and confusion reigned.

When we want to use incidents for second-hand experiential learning, it shifts the focus of an incident investigation away from action items as being the primary outcome and towards the narrative, the story we want tell.

When we hire for senior positions, we don’t ask candidates to submit a list of action items for tasks that could improve our system. We believe the value of their experience lies in them being able to solve novel problems in the future. Similarly, I don’t think we should view incident investigations as being primarily about generating action items. If, instead, we view them as an opportunity to learn collectively from the experiences of individuals, then more of us will get better at solving novel problems in the future.

The Gamma Knife model of incidents

Safety researchers love using metaphors as a framework to describe how accidents happen, which they call accident models.

One of the earliest models, dating back to 1931, is Herbert W. Heinrich’s domino model of accident causation:

Image source:

About sixty years later, in 1990, James Reason proposed the Swiss cheese model of accident causation:

By Davidmack – Own work, CC BY-SA 3.0,

About seven years later, in 1997, Jens Rasmussen proposed the dynamic safety model. This model doesn’t have an evocative a name as “domino” or “Swiss cheese”. I like to call it the “boundary” model, because everyone talks about it in terms of drifting towards a safety boundary:

This diagram originally appears in Rasmussen’s paper Risk management in a dynamic society: a modelling problem. I re-created the diagram from that paper.

I haven’t encountered a good metaphor that captures the role of multiple contributing factors in incidents. I’m going to propose one and call it the Gamma knife model of incidents.

Gamma knife is a system that surgeons use for treating brain tumors by focusing multiple beams of gamma radiation on a small volume inside of the brain.

Multiple beams of gamma radiation converge on the target. From the Radiosurgery wikipedia page.

Each individual beam is of low enough intensity that it doesn’t affect brain tissue. It is only when multiple beams intersect at one point that the combined intensity of the radiation has an impact.

Every day inside of your system, there are things that are happening (or not happening(!)) that could potentially enable an incident. You can think of each of these as a low-level beam of gamma radiation going off in a random direction. Somebody pushes a change to production, zap! Somebody makes a configuration change with a typo, zap! Somebody goes on vacation, zap! There’s an on-call shift change, zap! A particular service hasn’t been deployed in weeks, zap!

Most of these zaps are harmless, they have no observable impact on the health of the overall system. Sometimes, though, many of these zaps will happen to go off at the same time and all point to the same location. When that happens, boom, you have an incident on your hands.

Alas, there’s no way to get rid of all of those little beams of radiation that go off. You can eliminate some of them, but in the process, you’ll invariably create new ones. There are some you can’t avoid, and there are many that you don’t even see, unless you know how to look for them. One of the reasons I am interested in otherwise harmless operational surprises is that they can reveal the existence of previously unknown beams.

In service of the narrative

The most important part of an operational surprise writeup is the narrative description. That section of the writeup tells the story of how the surprise unfolded over time, taking into account the perspectives of the different people who were involved. If you want your readers to learn about how work is done in your organization, you need to write effective narrative descriptions.

Narrative descriptions need to be engaging. The best ones are vivid and textured: they may be quite long, but they keep people reading until the end. A writeup with a boring narrative has no value, because nobody will read through it.

Writing engaging narrative descriptions is hard. Writing a skill, and like all skills, the only way to get better is through practice. That being said, there are some strategies that I try to keep in mind to make my narrative descriptions more effective. In this blog post, I cover a few of them.

Goal is learning, not truth or completeness

At a high level, it’s important to keep in mind what you’re trying to achieve with your writeup. I’m interested in maximizing how much the reader will learn from the writeup. That goal should drive decisions you make on what to include and how to word things.

I’m not trying to get at the truth, because the truth is inaccessible. I’ll never know what really happened, and that’s ok, because my goal of learning doesn’t require perfect knowledge of the history of the world.

I’m also not trying to be complete; I don’t try to convey every single bit of data in the narrative that I’ve been able to capture in an investigation. For example, I don’t include every single exchange of a chat conversation in a narrative.

Because of my academic background, this is an instinct I have to fight: academics tend towards being as complete as possible in writing things up. However, including inappropriate level of detail makes the narrative harder to read.

I do include a “raw timeline” section in the appendix with lots of low level events that have been captured (chat transcripts, metrics data, times of when relevant production changes happened). These details don’t all make it into the narrative description, but they’re available if the reader wants to consult them.

Treat the people involved like people

Effective fiction authors create characters that you can empathize with. They convey what the characters see, what they feel, what they have experienced, what motivates them. If a character in a movie or a novel makes a decision that doesn’t seem to make sense to us, we get frustrated. We consider that lousy writing.

In a narrative description, you have to describe actions taken by people. These aren’t fictional characters, they are real people; they are the colleagues that you work alongside every day. However, like the characters in a good piece of fiction, your colleagues also make decisions based on what they see, what they feel, what they have experienced, and what motivates them.

The narrative must answer this question for the reader: How did it make sense for the people involved to come to their conclusions and take their actions? In order for your reader to learn this, you need to convey details such as what they were seeing, what they were thinking, what they knew and what they did not know. You want to try to tell the part of the narrative that describes their actions from their perspective.

One of the challenges is that you won’t have easy access to these details. That’s why an important precursor to doing a writeup is to talk with the people involved to try to get as much information as you can about how the world looked from their eyes as events were unfolding. Doing that well is too big a topic for this post.

Start with some background

I try never to start my narratives with “An alert fired for …”. There’s always a history behind the contributing factors that enabled the surprise. For the purposes of the writeup, that means starting the narrative further back in time, to tell the reader some of the relevant history.

You won’t be able to describe the historical information with the same level of vividness as the unfolding events, because it happened much further back in time, and the tempo of this part of the narrative is different from the part that describes the unfolding events. But that’s ok.

It’s also useful to provide additional context about how the overall system works, to help readers who may not be as familiar with the specific details of the systems involved. For example, you may have to explain what the various services involved actually do. Don’t be shy about adding this detail, since people who already know it will just skim this part. Adding these details also makes these writeups useful for new hires to learn how the system works.

Make explicit how details serve the narrative

If you provide details in your narrative description, it has to be obvious to the reader why you are telling them these details. For example, if you write an alert fired eight hours before the surprise, you need to make it obvious to the reader why this alert is relevant to the narrative. There may be very different reasons, for example:

  • This alert had important information about the nature of the operational surprise. However, it was an email only alert, not a paging one. And it was one of many email alerts that had fired, and those alerts are typically not actionable. It was ignored, just like the other ones.
  • The alert was a paging alert, and the on-call who engaged concluded that it was just noise. In fact, it was noise. However, when the real alert fired eight hours later, the symptom was the same, and the on-call assumed it was another example of noise.
  • The alert was a paging alert. The particular alert was unrelated to the surprise that would happen later, but it woke the on-call up in the middle of the night. They were quite tired the next day, when the surprise happened.

If you just say, “an alert fired earlier” without more detail, the reader doesn’t know why they should care about this detail in the writeup, which makes the writing less engaging. See also: The Law of Conservation of Detail.

Write in the present tense

This is just a stylistic choice of mine, but I find that if I write narratives in the present tense (e.g., “When X looks at the Y dashboard, she notices that signal Z has dropped…”), it reinforces the idea that the narrative is about understanding events as they were unfolding.

Use retrospective knowledge for foreshadowing

Unbeknownst to the princess but knownst to us, danger lurks in the stars above…

Opening crawl from the movie “Spaceballs”

When you are writing up a narrative description, you know a lot more about what happened than the people who were directly involved in the operational surprise as it was happening.

You can use this knowledge to make the writing more compelling through foreshadowing. You know about the consequences of actions that the people in the narrative don’t.

To help prevent the reader falling into the trap of hindsight bias, in your writeup, make it as explicit as possible that the knowledge the reader had is not knowledge that the people involved had. For example:

At 11:39, X takes action Y. What X does not know is that, six months earlier, Z had deployed a change to service Q, which changes what happens when action Y is taken.

This type of foreshadowing is helpful for two reasons:

  • It pushes against hindsight bias by calling out explicitly how it came to be that a person involved had a mental model that deviated from reality.
  • It creates “what happened next?” tension in the reader, encouraging them to read on.


We all love stories. We learn best from our own direct experiences, but storytelling provides an opportunity for us to learn from the experiences of others. Writing effective narratives is a kind of superpower because it gives you the ability to convey enormous amounts of detail to a large number of people. It’s a skill worth developing.