In service of the narrative

The most important part of an operational surprise writeup is the narrative description. That section of the writeup tells the story of how the surprise unfolded over time, taking into account the perspectives of the different people who were involved. If you want your readers to learn about how work is done in your organization, you need to write effective narrative descriptions.

Narrative descriptions need to be engaging. The best ones are vivid and textured: they may be quite long, but they keep people reading until the end. A writeup with a boring narrative has no value, because nobody will read through it.

Writing engaging narrative descriptions is hard. Writing a skill, and like all skills, the only way to get better is through practice. That being said, there are some strategies that I try to keep in mind to make my narrative descriptions more effective. In this blog post, I cover a few of them.

Goal is learning, not truth or completeness

At a high level, it’s important to keep in mind what you’re trying to achieve with your writeup. I’m interested in maximizing how much the reader will learn from the writeup. That goal should drive decisions you make on what to include and how to word things.

I’m not trying to get at the truth, because the truth is inaccessible. I’ll never know what really happened, and that’s ok, because my goal of learning doesn’t require perfect knowledge of the history of the world.

I’m also not trying to be complete; I don’t try to convey every single bit of data in the narrative that I’ve been able to capture in an investigation. For example, I don’t include every single exchange of a chat conversation in a narrative.

Because of my academic background, this is an instinct I have to fight: academics tend towards being as complete as possible in writing things up. However, including inappropriate level of detail makes the narrative harder to read.

I do include a “raw timeline” section in the appendix with lots of low level events that have been captured (chat transcripts, metrics data, times of when relevant production changes happened). These details don’t all make it into the narrative description, but they’re available if the reader wants to consult them.

Treat the people involved like people

Effective fiction authors create characters that you can empathize with. They convey what the characters see, what they feel, what they have experienced, what motivates them. If a character in a movie or a novel makes a decision that doesn’t seem to make sense to us, we get frustrated. We consider that lousy writing.

In a narrative description, you have to describe actions taken by people. These aren’t fictional characters, they are real people; they are the colleagues that you work alongside every day. However, like the characters in a good piece of fiction, your colleagues also make decisions based on what they see, what they feel, what they have experienced, and what motivates them.

The narrative must answer this question for the reader: How did it make sense for the people involved to come to their conclusions and take their actions? In order for your reader to learn this, you need to convey details such as what they were seeing, what they were thinking, what they knew and what they did not know. You want to try to tell the part of the narrative that describes their actions from their perspective.

One of the challenges is that you won’t have easy access to these details. That’s why an important precursor to doing a writeup is to talk with the people involved to try to get as much information as you can about how the world looked from their eyes as events were unfolding. Doing that well is too big a topic for this post.

Start with some background

I try never to start my narratives with “An alert fired for …”. There’s always a history behind the contributing factors that enabled the surprise. For the purposes of the writeup, that means starting the narrative further back in time, to tell the reader some of the relevant history.

You won’t be able to describe the historical information with the same level of vividness as the unfolding events, because it happened much further back in time, and the tempo of this part of the narrative is different from the part that describes the unfolding events. But that’s ok.

It’s also useful to provide additional context about how the overall system works, to help readers who may not be as familiar with the specific details of the systems involved. For example, you may have to explain what the various services involved actually do. Don’t be shy about adding this detail, since people who already know it will just skim this part. Adding these details also makes these writeups useful for new hires to learn how the system works.

Make explicit how details serve the narrative

If you provide details in your narrative description, it has to be obvious to the reader why you are telling them these details. For example, if you write an alert fired eight hours before the surprise, you need to make it obvious to the reader why this alert is relevant to the narrative. There may be very different reasons, for example:

  • This alert had important information about the nature of the operational surprise. However, it was an email only alert, not a paging one. And it was one of many email alerts that had fired, and those alerts are typically not actionable. It was ignored, just like the other ones.
  • The alert was a paging alert, and the on-call who engaged concluded that it was just noise. In fact, it was noise. However, when the real alert fired eight hours later, the symptom was the same, and the on-call assumed it was another example of noise.
  • The alert was a paging alert. The particular alert was unrelated to the surprise that would happen later, but it woke the on-call up in the middle of the night. They were quite tired the next day, when the surprise happened.

If you just say, “an alert fired earlier” without more detail, the reader doesn’t know why they should care about this detail in the writeup, which makes the writing less engaging. See also: The Law of Conservation of Detail.

Write in the present tense

This is just a stylistic choice of mine, but I find that if I write narratives in the present tense (e.g., “When X looks at the Y dashboard, she notices that signal Z has dropped…”), it reinforces the idea that the narrative is about understanding events as they were unfolding.

Use retrospective knowledge for foreshadowing

Unbeknownst to the princess but knownst to us, danger lurks in the stars above…

Opening crawl from the movie “Spaceballs”

When you are writing up a narrative description, you know a lot more about what happened than the people who were directly involved in the operational surprise as it was happening.

You can use this knowledge to make the writing more compelling through foreshadowing. You know about the consequences of actions that the people in the narrative don’t.

To help prevent the reader falling into the trap of hindsight bias, in your writeup, make it as explicit as possible that the knowledge the reader had is not knowledge that the people involved had. For example:

At 11:39, X takes action Y. What X does not know is that, six months earlier, Z had deployed a change to service Q, which changes what happens when action Y is taken.

This type of foreshadowing is helpful for two reasons:

  • It pushes against hindsight bias by calling out explicitly how it came to be that a person involved had a mental model that deviated from reality.
  • It creates “what happened next?” tension in the reader, encouraging them to read on.

Conclusion

We all love stories. We learn best from our own direct experiences, but storytelling provides an opportunity for us to learn from the experiences of others. Writing effective narratives is a kind of superpower because it gives you the ability to convey enormous amounts of detail to a large number of people. It’s a skill worth developing.

The problem with counterfactuals

Incidents make us feel uncomfortable. They remind us that we don’t have control, that the system can behave in ways that we didn’t expect. When an incident happens, the world doesn’t make sense.

A natural reaction to an incident is an effort to identify how the incident could have been avoided. The term for this type of effort is counterfactual reasoning. It refers to thinking about how, if the people involved had taken different actions, events would have unfolded differently. Here are two examples of counterfactuals:

  • If the engineer who made the code change had written a test for feature X, then the bug would never have made its way into production.
  • If the team members had paid attention to the email alerts that had fired, they would have diagnosed the problem much sooner.

Counterfactual reasoning is comforting because it restores the feeling that the world makes sense. What felt like a surprise is, in fact, perfectly comprehensible. What’s more, it could even have been avoided, if only we had taken the right actions and paid attention to the right signals.

While counterfactual reasoning helps restore our feeling that the world makes sense, the problem with it is that it doesn’t help us get better at avoiding or dealing with future incidents. The reason it doesn’t help is that counterfactual reasoning gives us an excuse to avoid the messy problem of understanding how we missed those obvious-in-retrospect actions and signals in the first place.

It’s one thing to say “they should have written a test for feature X”. It’s another thing to understand the rationale behind the engineer not writing that test. For example:

  • Did they believe that this functionality was already tested in the existing test suite?
  • Were they not aware of the existence of the feature that failed?
  • Were they under time pressure to get the code pushed into production (possibly to mitigate an ongoing issue)?

Similarly, saying “they should have paid closer to attention to the email alerts” means you might miss the fact that the email alert in question isn’t actionable 90% of the time, and so the team has conditioned themselves to ignore it.

To get better at avoiding or mitigating future incidents, you need to understand the conditions that enabled past incidents to occur. Counterfactual reasoning is actively harmful for this, because it circumvents inquiry into those conditions. It replaces “what were the circumstances that led to person X taking action Y” with “person X should have done Z instead of Y”.

Counterfactual reasoning is only useful if you have a time machine and can go back to prevent the incident that just happened. For the rest of us who don’t have time machines, counterfactual reasoning helps us feel better, but it doesn’t make us better at engineering and operating our systems. Instead, it actively prevents us from getting better.

Don’t ask “why didn’t they do Y instead of X?” Instead, ask, “how was it that doing X made sense to them at the time?” You’ll learn a lot more about the world if you ask questions about what did happen instead of focusing on what didn’t.

Experts aren’t good at building shared understanding

If only HP knew what HP knows, we would be three times more productive.

Lew Platt, former CEO of Hewlett-Packard

One pattern that you see over and over again in operational surprises is that a person who was involved in the surprise was missing some critical bit of information. For example, there may be an implicit contract that becomes violated when someone makes a code change. Or there might be a certain batch job that runs every Tuesday at 4PM might trigger and puts some additional load on the database.

Almost always, this kind of information is present in the head of someone else within the organization. It just wasn’t in the head of the person who really needed it at that moment.

I think the problem of missing information is well understood enough that you see variants of it crop in different places. Here are some examples I’ve encountered:

It turns out that experts are very good at accumulating these critical bits of information and recalling them at the appropriate time. Experts are also very good at communicating efficiently with others who share a lot of that critical information in their heads.

However, what experts are not very good at is transmitting this information to others who don’t yet have it. Experts aren’t explicitly aware of the value of all of this information, and so they tend not to volunteer it without being asked. When a newcomer watches an expert in action, a common refrain is, “how did you know to do that?”

The fact that experts aren’t good at sharing the useful information that they know is one of the challenges that incident investigators face. One of the skills of an investigator is how to elicit these bits of knowledge through interviews.

I think that advancing shared understanding in an organization has the potential to be enormously valuable. One of the things that I hope to accomplish with sharing out writeups of operational surprises is to use them as a vehicle for doing so.

Even if there isn’t a single actionable outcome from a writeup, you never know when that critical bit of knowledge that has been implanted in the heads of the readers will come in handy.

Tuning to the future

In short, the resilience of a system corresponds to its adaptive capacity tuned to the future. [emphasis added]

Branlat, Matthieu & Woods, David. (2010). How do systems manage their adaptive capacity to successfully handle disruptions? A resilience engineering perspective. AAAI Fall Symposium – Technical Report

In simple terms, an incident is a bad thing that has happened that was unexpected. This is just the sort of thing that makes people feel uneasy. Instinctively, we want to be able to say “We now understand what has happened, and we are taking the appropriate steps to make sure that this never happens again.”

But here’s the thing. Taking steps to prevent the last incident from recurring doesn’t do anything to help you deal with the next incident, because your steps will have ensured that the next one is going to be completely different. There is, however, one thing that your next incident will have in common with the last one: both of them are surprises.

We can’t predict the future, but we can get better at anticipating surprise, and dealing with surprise when it happens. Getting better at dealing with surprise is what resilience engineering is all about.

The first step is accepting that surprise is inevitable. That’s hard to do. We want to believe that we are in control of our systems, that we’ve plugged all of the holes. Sure, we may have had a problem before, but we fixed that. If we can just take the time to build it right, it’ll work properly.

Accepting that future operational surprises are inevitable isn’t natural for engineers. It’s not the way we think. We design systems to solve problems, and one of the problems is staying up. We aren’t fatalists.

However, once we do accept that operational surprise is inevitable, we can shift our thinking of the system from the computer-based system to the broader socio-technical system that includes both the people and the computers. The solution space here looks very different, because we aren’t used to thinking about designing systems where people are part of the system, especially when we engineers are part of the system we’re building!

But if we want the ability to handle things the future is going to throw at us, then we need to get better at dealing with surprise. Computers are lousy at this, they can’t adapt to situations they weren’t designed to handle. But people can.

In this frame, accepting that operational surprises are inevitable isn’t fatalism. Building adaptive capacity to deal with future surprises is how we tune to the future.

Contributors, mitigators & risks: Cloudflare 2019-07-02 outage

John Graham-Cumming, Cloudflare’s CTO, wrote a detailed writeup of a Cloudflare incident that happened on 2019-07-02. Here’s a categorization similar to the one I did for the Stripe outage.

Note that Graham-Cumming has a “What went wrong” section in his writeup where he explicitly enumerates 11 different contributing factors; I’ve sliced things a little differently here: I’ve taken some of those verbatim, reworded some of them, and left out some others.

All quotes from the original writeup are in italics.

Contributing factors

Remember not to think of these as “causes” or “mistakes”. They are merely all of the things that had to be true for the incident to manifest, or for it to be as severe as it was.

Regular expression lead to catastrophic backtracking

A regular expression used in a firewall engine rule resulted in catastrophic backtracking:

(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Simulated rules run on same nodes as enforced rules

This particular change was to be deployed in “simulate” mode where real customer traffic passes through the rule but nothing is blocked. We use that mode to test the effectiveness of a rule and measure its false positive and false negative rate. But even in the simulate mode the rules actually need to execute and in this case the rule contained a regular expression that consumed excessive CPU.

Failure mode prevented access to internal services

But getting to the global WAF [web application firewall] kill was another story. Things stood in our way. We use our own products and with our Access service down we couldn’t authenticate to our internal control panel … And we couldn’t get to other internal services like Jira or the build system.

Security feature disables credentials for infrequent use for an operator interface

[O]nce we were back we’d discover that some members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently

Bypass mechanisms not frequently used

And we couldn’t get to other internal services like Jira or the build system. To get to them we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

WAF changes are deployed globally

The diversity of Cloudflare’s network and customers allows us to test code thoroughly before a release is pushed to all our customers globally. But, by design, the WAF doesn’t use this process because of the need to respond rapidly to threats … Because WAF rules are required to address emergent threats they are deployed using our Quicksilver distributed key-value (KV) store that can push changes globally in seconds

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

The fact that WAF changes can only be done globally exacerbated the incident by increasing the size of the blast radius.

WAF implemented in Lua, which uses PCRE

Cloudflare makes use of Lua extensively in production … The Lua WAF uses PCRE internally and it uses backtracking for matching and has no mechanism to protect against a runaway expression.

The regular expression engine being used didn’t have complexity guarantees.

Based on the writeup, it sounds like they used the PCRE regular expression library because PCRE is the regex library that ships with Lua, and Lua is the language they use to implement the WAF.

Protection accidentally removed by a performance improvement refactor

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

Cloudflare dashboard and API are fronted by the WAF

Our customers were unable to access the Cloudflare Dashboard or API because they pass through the Cloudflare edge.

Mitigators

Paging alert quickly identified a problem

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic processThree minutes later the first PagerDuty page went out indicating a fault with the WAF. This was a synthetic test that checks the functionality of the WAF (we have hundreds of such tests) from outside Cloudflare to ensure that it is working correctly. This was rapidly followed by pages indicating many other end-to-end tests of Cloudflare services failing, a global traffic drop alert, widespread 502 errors and then many reports from our points-of-presence (PoPs) in cities worldwide indicating there was CPU exhaustion.

Engineers recognized high severity based on alert pattern

This pattern of pages and alerts, however, indicated that something gravely serious had happened, and SRE immediately declared a P0 incident and escalated to engineering leadership and systems engineering.

Existence of a kill switch

At 14:02 the entire team looked at me when it was proposed that we use a ‘global kill’, a mechanism built into Cloudflare to disable a single component worldwide.

Risks

Declarative program performance is hard to reason about

Regular expressions are examples of declarative programs (SQL is another good example of a declarative programming language). Declarative programs are elegant because you can specify what the computation should do without needing to specify how the computation can be done.

The downside is that it’s impossible to look at a declarative program and understand the performance implications, because there isn’t enough information in a declarative program to let you know how it will be executed! You have to be familiar with how the interpreter/compiler works to understand the performance implications of a declarative program. Most programmers probably don’t know how regex libraries are implemented.

Simulating in production environment

For rule-based systems, it’s enormously valuable for an engineer to be able to simulate what effect the rules will have before they’re put into effect, as it is generally impossible to reason about their impacts without doing simulation.

The more realistic the simulation is, the more confidence we have that the results of the simulation will correspond to the actual results when the rules are enabled in production.

However, there is always a risk of doing the simulation in the production environment, because the simulation is a type of change, and all types of change carry some risk.

Severe outages happen infrequently

[S]ome members of the team had lost access because of a security feature that disables their credentials if they don’t use the internal control panel frequently.

To get to [internal services] we had to use a bypass mechanism that wasn’t frequently used (another thing to drill on after the event). 

The irony is that when we only encounter severe outages infrequently, we don’t have the opportunity to exercise the muscles we need to use when these outages do happen.

Large blast radius

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout.

In the future, it sounds like non-emergency rule changes will be staged at Cloudflare. But the functionality will still exist for global changes, because it needs to be there for emergency rule changes. They can reduce the amount of changes that need to get pushed globally, but they can’t drive it down to zero. This is an inevitable risk tradeoff.

Questions

Why hasn’t this happened before?

You’re not generally supposed to ask “why” questions, but I can’t resist this one. Why did it take so long for this failure mode to manifest? Hadn’t any of the engineers at Cloudflare previously written a rule that used a regex with pathological backtracking behavior? Or was it that refactor that removed their protection from excessive CPU load in the case of regex backtracking?

What was the motivation for the refactor?

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

What was the reason they were trying to make the WAF use less CPU? Were they trying to reduce the cost by running on fewer nodes? Were they just trying to run cooler to reduce the risk of running out of CPU? Was there some other rationale?

What’s the history of WAF rule deployment implementation?

The SOP allowed a non-emergency rule change to go globally into production without a staged rollout. [emphasis added]

The WAF rule system is designed to support quickly pushing out rules globally to protect against new attacks. However, not all of the rules require quick global deployment. Yet, this was the only mechanism that the WAF system supported, even though code changes support staged rollout.

The writeup simply mentions this as a contributing factor, but I’m curious as to how the system came to be that way. For example, was it originally designed with only quick rule deployment in mind? Were staged code deploys introduced into Cloudflare only after the WAF system was built?

Other interesting notes

The WAF rule update was normal work

At 13:42 an engineer working on the firewall team deployed a minor change to the rules for XSS detection via an automatic process. 

Based on the writeup, it sounds like this was a routine change. It’s important to keep in mind that incidents often occur as a result of normal work.

Multiple sources of evidence to diagnose CPU issue

The Performance Team pulled live CPU data from a machine that clearly showed the WAF was responsible. Another team member used strace to confirm. Another team saw error logs indicating the WAF was in trouble.

It was interesting to read how they triangulated on high CPU usage using multiple data sources.

Normative language in the writeup

Emphasis added in bold.

We know how much this hurt our customers. We’re ashamed it happened.

The rollback plan required running the complete WAF build twice taking too long.

The first alert for the global traffic drop took too long to fire.

We didn’t update our status page quickly enough.

Normative language is one of the three analytical traps in accident investigation. If this was an internal writeup, I would avoid the language criticizing the rollback plan, the alert configuration, and the status page decision, and instead I’d ask questions about how these came to be, such as:

Was this the first time the rollback plan was carried out? (If so, that may explain the reason why it wasn’t known how long it would take).

Is the global traffic drop alert configuration typical of the other alerts, or different? If it’s different (i.e., other alerts fire faster?) what led to it being different? If it’s similar to other alert configurations, that would explain why it was configured to be “too long”.

Work to reduce CPU usage contributed to excessive CPU usage

A protection that would have helped prevent excessive CPU use by a regular expression was removed by mistake during a refactoring of the WAF weeks prior—a refactoring that was part of making the WAF use less CPU.

These sorts of unintended consequences are endemic when making changes within complex systems. It’s an important reminder that the interventions we implement to prevent yesterday’s incidents from recurring may contribute to completely new failure modes tomorrow.