Friday, April 16, 2010

Seven consecutive errors = A Catastrophe.. (Improving testing)

"A typical accident takes seven consecutive errors" quoted Malcolm Gladwell in his book "The Outliers". As always Malcolm's books are a fascinating read. In the chapter on "The theory of plane crashes" , he analyses the airplane disasters and states that it is a series of small errors that results in a catastrophe. " Plane crashes are much more likely to be a result of an accumulation of minor difficulties and seemingly trivial malfunctions" says Gladwell. The other example he quotes is the famous accident - "Three Mile Island" (nuclear station disaster in 1979).  It came near meltdown, the result of seven consecutive errors - (1) blockage in a giant water filter causes (2)moisture to leak into plant's air system (3) inadvertently trips two valves (4) shuts down flow of cold water into generator (5) backup system's cooling valves are closed - a human mistake (6) indicator in the control room showing that they are closed is blocked by a repair tag (7) another backup system, a relief valve was not working.

This notion is reflected in the book "Ubiquity" by Mark Buchanan too. He states that systems have a natural tendency to organize themselves into what is called the "critical state" in what Buchanan states as the "knife-edge of stability". When the system reaches the "critical state", all it takes is a small nudge to create a catastrophe.

Now as a person interested in breaking software and uncovering defects, I am curious to understand how I can test better. When we test a software/system, we look forward to uncovering "critical" defects. When an error is injected to system and it propagates all the way, it results in a failure. All failures are not critical, some are irritating deviations, while some can be catastrophic. If a  simple error injected results in a critical failure, we are lucky! How the heck do we know that similar catastrophes will not surface over time. Should we not think using the above reasoning? i.e. occurrence of consecutive errors each resulting in a minor failure, ultimately culminating in a critical failure. Should we not have a variant strategy for uncovering such "potential catastrophes" ? Do we outline the strategy that is indeed different for simple failures versus potential critical failures? When can we apply it? Not necessarily only when testing... This thought process can be applied in the earlier stage of design/code... Using the notion of sequence of errors and understanding what can happen.. 

If your drive in India you know what I mean ... the potential accident due to a dog chasing a cow, which is charging into the guy driving the motorbike, who is talking on the cell phone, driving on the wrong side of road, encounters a "speed bump" , and screech *@^%... You avoid him if you are a defensive driver. Alas we do not always apply the same defensive logic to other disciplines like software engineering commonly enough...

Happy weekend. Be safe.

No comments: