ISSUE #5 - Mar 5, 2017
This issue will focus on the nature of catastrophe and its prevention.
Great “short treatise on the Nature of Failure” that you’d swear was written by an engineer for engineers… actually it’s by a medical doctor, published nearly 20 years ago in the context of patient diagnosis. If this link ever goes dead, just google “How Complex Systems Fail” by Richard I. Cook, MD.
If you’d like a discourse more targeted at software, here is an overview of “Why Systems Fail” with industry examples (also older, circa 2000):
Related line of thought in a recent post on “Black Swans”: the really bad things that seem improbable but have an outsized impact and get rationalized after the fact…
Can we find the critical bugs that trigger real-life catastrophe? Yes, says this review of a few distributed systems (such as Cassandra and HDFS). 92% of catastrophic failures resulted from incorrect handling of simple error conditions. A third of those could have been found by a static code checker, and another third by code inspection with a little knowledge of the system. Also, 77% of production failures can be reproduced by a unit test, and 98% on a system with at most 3 nodes. Critical bugs are not that hard to uncover…
How bad can it get due to software bugs? Catastrophe porn:
The Role of Software in Recent Catastrophic Accidents, from 2009 IEEE Reliability Society report, covers airport shutdowns, plane crashes, NASA disasters, US/Canadian blackout of 2003 (hey, I was there!), and more:
20 famous software disasters, covers time period 1962 to 2005, starting with Mariner 1 rocket, with cost / summary / cause for each:
11 “epic failures”, overlapping with the above but described in more detail: Mariner 1 again (punctuation matters), Mars Orbiter 1998 (watch your units), Pentium floating point flaw of 1994 (why it matters to fix a rare bug reported by a single user), and more:
Another angle at the famous older incidents / accidents including Therac 25:
Award-worthy software failures of 2016. These guys also have older blog entries for prior years, and a detailed report available for download, which they apparently compile by searching English language news for verbiage on “software bugs”. They cover magnitude of the incidents, but don’t go into root causes.
What caused the AWS S3 outage on Feb 28th, Amazon’s own summary:
If you prefer a less dry retelling, The Verge will tell you it was due to a typo. Keep in mind, however, that catastrophe never has a single cause…
It’s not all doom and gloom. For those who still remember the Mad Gadget vulnerability in Apache Commons Collections back in late 2015, good news! Google employees organized a volunteer initiative, called Operation Rosehub, to find open source projects that were still affected, and submitted PRs to fix all 2600 of them. The world is more secure thanks to those folks.
When you are frustrated over this bug or that outage, head for the anger rooms!
If you didn’t get your fill of software disasters, Reddit will happily provide a daily dose:
If you received this email directly then you’re already signed up, thanks! Else if this newsletter issue was forwarded to you and you’d like to get one weekly, then you can subscribe at http://testersdigest.mehras.net
If you come across content worth sharing, please send me a link at email@example.com