The Equifax breach report

The House Oversight and Government Reform Committee released a report on the big Equifax data breach that happened last year. In a nutshell, a legacy application called ACIS contained a known vulnerability that attackers used to gain access to internal Equifax databases.

A very brief timeline is:

  • Day 0: (3/7/17) Apache Struts vulnerability CVE-2017-5638 is publicly announced
  • Day 1: (3/8/17) US-CERT sends an alert to Equifax about the vulnerability
  • Day 2: (3/9/17) Equifax’s Global Threat and Vulnerability Management (GTVM) team posts to an internal mailing list about the vulnerability and requests that app owners should patch within 48 hours
  • Day 37: (4/13/17) Attackers exploit the vulnerability in the ACIS app

The report itself is… frustrating. There is some good content here. The report lays out multiple factors that enabled the breach, including:

  • A scanner that was run but missed the vulnerable app because of the directory that the scan ran in
  • An expired SSL certificate that prevented Equifax from detecting malicious activity
  • The legacy nature of the vulnerable application (originally implemented in the 1970s)
  • A complex IT environment that was the product of multiple acquisitions.
  • An organizational structure where the chief security officer and the chief information officer were in separate reporting structures.

The last bullet, about the unconventional reporting structure for the chief security officer, along with the history of that structure, was particularly insightful. It would have been easy to leave out this sort of detail in a report like this.

On the other hand, the report exhibits some weapons-grade hindsight bias. To wit:

 Equifax, however, failed to implement an adequate security program to protect this sensitive data. As a result, Equifax allowed one of the largest data breaches in U.S. history. Such a breach was entirely preventable.

Equifax failed to fully appreciate and mitigate its cybersecurity risks. Had the company taken action to address its observable security issues prior to this cyberattack, the data breach could have been prevented.

Page 4

Equifax knew its patch management process was ineffective.501 The 2015 Patch Management Audit concluded “vulnerabilities were not remediated in a timely manner,” and “systems were not patched in a timely manner.” In short, Equifax recognized the patching process was not being properly implemented, but failed to take timely corrective action.

Page 80

The report highlights a number of issues that, if they had been addressed, would have prevented or mitigated the breach, including:

Lack of a clear owner of the vulnerable application. An email went out announcing the vulnerability, but nobody took action to patch the vulnerable app.

Lack of a comprehensive asset inventory. The company did not have a database where that they could query to check if any published vulnerabilities applied to any applications in use.

Lack of network segmentation in the environment where the vulnerable app ran. The vulnerable app ran a network that was not segmenting from unrelated databases. Once the app was compromised, it was used as a vector to reach these other databases.

Lack of integrity file monitoring (FIM). FIM could have detected malicious activity, but it wasn’t in place. 

Not prioritizing retiring the legacy system. This one is my favorite. From the report: “Equifax knew about the security risks inherent in its legacy IT systems, but failed to prioritize security and modernization for the ACIS environment”.

Use of NFS. The vulnerable system had an NFS mount, that allowed the attackers to access a number of files.

Frustratingly, the report does not go into any detail about how the system got into this state.  It simply lays them out like an indictment for criminal negligence. Look at all of these deficiencies! They should have known better! Even worse, they did know better and didn’t act!

In particular, the report doesn’t dig enough into the communication breakdown that resulted in ACIS not being patched. Here’s an exchange that gets close:

To determine who was responsible for applying the Apache Struts patch to the ACIS system, the Committee asked [former Senior Vice President and Chief Information Officer for Global Corporate Platforms Graeme] Payne to identify employees by the roles listed within the Patch Management Policy. Specifically, the Committee asked him to identify the business owner, system owner, and application owner responsible for the ACIS system. Payne testified:

Q. So the application owner for ACIS would have been who or what organization?
A. So I don’t believe there was any explicit designation of application owners. If you ask me who I think the application owner would be, I can probably answer that.

Q. That would be good.
A. So I believe – in my view, the application owner for ACIS – for the online dispute portal component because that was a component – was [Equifax IT Employee 1] and probably also [Equifax IT Employee 2]. So again, I don’t believe there were any specific designations, so these would be – if someone asked me, “Who do you think they would be?” that would probably be the two people I would look at.

**

Q. So would they have been the people that should have received the GTVM email saying you need to patch?
A. Yes, as well as the system owner.

Q. Okay. Who’s the system owner?
A. So again, those people weren’t designated. So I can –

Q. Tell me who you think?
A. My guess would be that the system owner would be someone in the infrastructure group probably under [Equifax IT Employee 3], since…as part of the global platform services group, his team ran the sort of the server operations

***

Q. If you look at the definition . . . it says: System owner is responsible for applying patch to electronic assets.

So would it be the case that [Equifax IT Employee 3] would have been the one responsible for actually applying the patch to ACIS?
A. Possibly. Again, we are talking at a level that I wasn’t involved in, so I can’t talk specifically about…who actually

Pages 65-66

Alas, the committee doesn’t seem to have interviewed any of the ICs referenced as Equifax IT Employees 1-3. How did they understand how ownership works? In addition, there’s also no context here about how ownership generally works inside of Equifax. Was ACIS a special case, or was it typical? 

There was also a theme that anyone who was worked in a software project would recognize:

[Former Chief Security Officer Susan] Mauldin stated Equifax was in the process of making the ACIS application Payment Card Industry (PCI) Data Security Standard (DSS) compliant when the data breach occurred.

Mauldin testified the PCI DSS implementation “plan fell behind and these items did not get addressed.” She stated:

A. The PCI preparation started about a year before, but it’s very complex. It was a very complex – very complex environment.

Q. year before, you mean August 2016?

A. Yes, in that timeframe.

Q. And it was scheduled to be complete by August 2017?

A. Right.

Q. But it fell behind?

A. It fell behind.

Q. Do you know why?

A. Well, what I recall from the application team is that it was very complicated, and they were having – it just took a lot longer to make the changes than they thought. And so they just were not able to get everything ready in time.

Pages 80-81

And, along the same lines:

So there were definitely risks associated with the ACIS environment that we were trying to remediate and that’s why we were doing the CCMS upgrade.

It was just – it was time consuming, it was risky . . . and also we were lucky that we still had the original developers of the system on staff.


So all of those were risks that I was concerned about when I came into this role. And security was probably also a risk, but it wasn’t the primary driver. The primary driver was to get off the old system because it was just hard to manage and maintain.

Graeme Payne, former Senior Vice President and Chief Information Officer for Global Corporate Platforms, page  82

Good luck finding a successful company that doesn’t face similar issues.

Finally, in a beautiful example of scapegoating, there’s the Senior VP that Equifax fired, ostensibly for failing to forward an email that had already been sent to an internal mailing list. In the scapegoat’s own words:

To assert that a senior vice president in the organization should be forwarding vulnerability alert information to people . . . sort of three or four layers down in the organization on every alert just doesn’t hold water, doesn’t make any sense. If that’s the process that the company has to rely on, then that’s a problem.

Graeme Payne, former Senior Vice President and Chief Information Officer for Global Corporate Platforms, page 51

The Tortoise and the Hare in TLA+

Problem: Determine if a linked list contains a cycle, using O(1) space.

Robert Floyd invented an algorithm for solving this problem in the late 60s, which is called “The Tortoise and the Hare”[1].

(This is supposedly a popular question to ask in technical interviews. I’m not a fan of expecting interviewees to re-invent the algorithms of famous computer scientists on the spot).

As an exercise, I implemented the algorithm in TLA+, using PlusCal.

The algorithm itself is very simple: the hardest part was deciding how to model linked lists in TLA+. We want to model the set of all linked lists, to express that algorithm should work for any element of the set. However, the model checker can only work with finite sets. The typical approach is to do something like “the set of all linked lists which contain up to N nodes”, and then run the checker against different values of N.

What I ended up doing was generating N nodes, giving each node a successor

(a next element, which could be the terminal value NIL), and then selecting

the start of the list from the set of nodes:

CONSTANTS N

ASSUME N \in Nat

Nodes == 1..N

NIL == CHOOSE NIL : NIL \notin Nodes

start \in Nodes
succ \in [Nodes -> Nodes \union {NIL}]

(Note: I originally defined NIL as a constant, but this is incorrect, see the comment by Ron Pressler for more details).
This is not the most efficient way from a model checker point of view, because the model checker will generate nodes that are irrelevant because they aren’t in the list. However, it does generate all possible linked lists up to length N.

Superficially, this doesn’t look like the pointer-and-structure linked list you’d see in C, but it behaves the same way at a high level. It’s possible to model a memory address space and pointers in TLA+, but I chose not to do so.

In addition, the nodes of a linked list typically have an associated value, but Floyd’s algorithm doesn’t use this value, so I didn’t model it.

Here’s my implementation of the algorithm:

EXTENDS Naturals

CONSTANTS N

ASSUME N \in Nat

Nodes == 1..N

NIL == CHOOSE NIL : NIL \notin Nodes

(*
--fair algorithm TortoiseAndHare

variables
    start \in Nodes,
    succ \in [Nodes -> Nodes \union {NIL}],
    cycle, tortoise, hare, done;
begin
h0: tortoise := start;
    hare := start;
    done := FALSE;
h1: while ~done do
        h2: tortoise := succ[tortoise];
            hare := LET hare1 == succ[hare] IN
                    IF hare1 \in DOMAIN succ THEN succ[hare1] ELSE NIL;
        h3: if tortoise = NIL \/ hare = NIL then
                cycle := FALSE;
                done := TRUE;
            elsif tortoise = hare then
                cycle := TRUE;
                done := TRUE;
            end if;
    end while;

end algorithm
*)

I wanted to use the model checker to verify the the implementation was correct:

PartialCorrectness == pc="Done" => (cycle <=> HasCycle(start))

(See the Correctness wikipedia page for why this is called “partial correctness”).

To check correctness, I needed to implement my HasCycle operator (without resorting to Floyd’s algorithm). I used the transitive closure of the successor function for this, which is called TC here. If the transitive closure contains the pair (node, NIL), then the list that starts with node does not contain a cycle:

HasCycle(node) == LET R == {<<s, t>> \in Nodes \X (Nodes \union {NIL}): succ[s] = t }
                  IN <<node, NIL>> \notin TC(R)

The above definition is “cheating” a bit, since we’ve defined a cycle by what it isn’t. It also wouldn’t work if we allowed infinite lists, since those don’t have cycles nor do they have NIL nodes.

Here’s a better definition: a linked list that starts with node contains a cycle if there exists some node n that is reachable from node and from itself.

HasCycle2(node) ==
  LET R == {<<s, t>> \in Nodes \X (Nodes \union {NIL}): succ[s] = t }
  IN \E n \in Nodes : /\ <<node, n>> \in TC(R) 
                      /\ <<n, n>> \in TC(R)

To implement the transitive closure in TLA+, I used an existing implementation

from the TLA+ repository itself:

TC(R) ==
  LET Support(X) == {r[1] : r \in X} \cup {r[2] : r \in X}
      S == Support(R)
      RECURSIVE TCR(_)
      TCR(T) == IF T = {} 
                  THEN R
                  ELSE LET r == CHOOSE s \in T : TRUE
                           RR == TCR(T \ {r})
                       IN  RR \cup {<<s, t>> \in S \X S : 
                                      <<s, r>> \in RR /\ <<r, t>> \in RR}
  IN  TCR(S)

The full model is the lorin/tla-tortoise-hare repo on GitHub.


  1. Thanks to Reginald Braithwaite for the reference in his excellent book Javascript Allongé.  ↩

Death of a pledge as systems failure

Caitlin Flanagan has written a fantastic and disturbing piece for the Atlantic entitled Death at a Penn State Fraternity.

This line really jumped out at me:

Fraternities do have a zero-tolerance policy regarding hazing. And that’s probably one of the reasons Tim Piazza is dead.

The official policy of the fraternities is that hazing is forbidden. Because this is the official policy, it is the individuals in a particular frat house that are held responsible if hazing happens, not the national fraternity organization.

This policy has had the effect of insulating the organizations from being liable, but it hasn’t stopped hazing from being widespread: according to Flanagan, 80% of fraternity members report being hazed.

Because individual fraternity members are the ones that are on the hook if something goes wrong during hazing, reporting an injury carries risk, which means the member must make a decision involving a tradeoff. In the case documented above, that tradeoff led to a nineteen year old dying of his injuries.

This example really reinforces ideas around systems thinking: the introduction of the zero-tolerance policy did not have the intended effect. Because the culture of hazing remains, the policy ended up making things worse.

Assumption of rationality

Matthew Reed wrote a post about Lisa Servon’s book “The Unbanking of America”. This comment stood out for me (emphasis mine):

By treating her various sources as intelligent people responding rationally to their circumstances, rather than as helpless victims of evil predators, [Servon] was able to stitch together a pretty good argument for why people make the choices they make.

In its approach, it reminded me a little of Tressie McMillan Cottom’s “Lower Ed” or Matthew Desmond’s “Evicted.”  In their different ways, each book addresses a policy question that is usually framed in terms of smart, crafty, evil people taking advantage of clueless, ignorant, poor people, and blows up the assumption.  In no case are predators let off the hook, but the “prey” are actually (mostly) capable and intelligent people doing the best they can.  Understanding why this is the best they can do, and what would give them better options, leads to a very different set of prescriptions.

 

Sidney Dekker calls this perspective the local rationality principle. It assumes that people make decisions that are reasonable given the constraints that they are working within, even though from the outside those decisions appear misguided.

I find this assumption of rationality to be a useful frame for explaining individual behavior. It’s worth putting in the effort to identify why a particular decision would have seemed rational within the context in which it was made.

A conjecture on why reliable systems fail

Even highly reliable systems go down occasionally. After having read over the details of several incidents, I’ve started to notice a pattern, which has led me to the following conjecture:

Once a system reaches a certain level of reliability, most major incidents will involve:

  • A manual intervention that was intended to mitigate a minor incident, or
  • Unexpected behavior of a subsystem whose primary purpose was to improve reliability

Here are three examples from Amazon’s post-mortem write-ups of major AWS outages:

The S3 outage on February 28, 2017 involved a manual intervention to debug an issue that was causing the S3 billing system to progress more slowly than expected.

The DynamoDB outage on September 20, 2015 (which also affected SQS, auto scaling, and CloudWatch) involved healthy storage servers taking themselves out of service by executing a distributed protocol that was (presumably) designed that way for fault tolerance.

The EBS outage on October 22, 2012 (which also affected EC2, RDS, and ELBs) involved a memory leak bug in an agent that monitors the health of EBS servers.

Engineering research reveals wrongdoing

The New York Times has a story today, Inside VW’s Campaign of Trickery, about how Volkswagon conspired to hide their excessive diesel emissions from California regulators.

What was fascinating to me was that the emission violations were discovered by mechanical engineering researchers at West Virginia University, Dan Carder, Hemanth Kappanna, and Marc Besch (Kappanna and Besch were graduate students at the time).

The presence of high levels of lead in Flint, Michigan drinking water was also discovered by an engineering researcher: Marc Edwards, a civil engineering professor at Virginia Tech.

It’s a reminder that regulators alone aren’t sufficient to ensure safety, and that academic engineering research can have a real impact on society.