ISSUE #8 - Mar 26, 2017
This issue focuses on the how and why of postmortems. We touched on catastrophic failures in an earlier newsletter:
Issue #5 - March 5, 2017 // Topic: Catastrophic Failures
What is one to do if the catastrophe already occurred? Learn from it, by running a postmortem analysis.
Do not look for a single root cause, as catastrophe is never caused by one person or one software bug, but a combination of factors including your organizational processes that jointly contribute to the disastrous outcome:
If you’re not familiar with the Swiss Cheese model of accident causation, it’s worth a look:
This great slide deck covers accident causality, RCA (Root Cause Analysis) principles such as “what you look for is what you find” (beware!), types of accident models that go beyond Swiss Cheese, including non-linear ones. It also poses a question that was new to me: do we learn better from rare accidents, or from the daily safe operation, and should we shoot for making the system “more safe” or “less unsafe”?
To quote from the deck above, “The purpose of learning (from accidents, etc.) is to change behaviour so that certain outcomes become more likely and other outcomes less likely”. This should be the ultimate goal of your postmortem analysis.
Many advocate “The 5 Whys” approach to running a postmortem, with the aim of getting from proximate causes to root causes of the incident. The 5 Whys comes from a Toyota tradition:
Eric Ries of “The Lean Startup” picked it up and ran with it as a tool for finding human causes of technical problems:
If you prefer Eric on video, HBR has one:
Example of using The 5 Whys in a startup postmortem, by Buffer:
More recently, GitLab used The 5 Whys in the postmortem of their data loss outage:
The danger of naive application of The 5 Whys method is in digging deeply to a single root cause, rather than going for broad understanding of multiple contributing factors. An alternative is a debriefing that asks “how?” rather than “why?” (the post links to tutorials at the end).
Note that the above critique is different from “5 whys and 5 hows” where the hows are just an inversion of the whys: “Why did the system fail? - Because the database was overloaded. How do we address that? - Let’s add more DB capacity”. That type of “hows” is simply a way to express action items from the postmortem. You will find it in some posts that I’m not quoting here as they tend to be surface level.
SRECon presentation that cautions against homing in on “bad software” (or hardware) in your root cause analysis:
Good tips on structuring your postmortems:
And if you’d like a ready-made tool to manage postmortems, Etsy’s got Morgue:
What happened to the GitLab engineer who served as the proximate cause of their data loss incident? He received lots of support from the company and the wider community, and is doing well. A great reminder to fix your process and not blame your people after an outage/catastrophe:
A most amusing time waster, “The Daily WTF is your how-not-to guide for developing software” and the premier place to post funny bugs you encounter when using software. Now I wish I kept a screenshot of my kids’ public school enrollment form with the question “Are you Hispanic or Latino?” and non-binary answer choices “Yes / No / Other”…
Emoji URL shortener! Here it is, linking back to Tester’s Digest. Those emoji are :bridge_at_night: :no_bicycles: :chipmunk:
If you received this email directly then you’re already signed up, thanks! Else if this newsletter issue was forwarded to you and you’d like to get one weekly, then you can subscribe at http://testersdigest.mehras.net
If you come across content worth sharing, please send me a link at email@example.com