Asking Why in IT

Thanks to reader Alex for sending me a detailed article on how one IT system administrator used the five whys to solve a network connectivity problem.

At 3:30 in the morning of January 10th, 2008, a shrill chirping woke up our system administrator, Michael Gorsuch, asleep at home in Brooklyn. It was a text message from Nagios, our network monitoring software, warning him that something was wrong. Michael logged onto his computer in the other room and discovered that one of the three data centers he runs, in downtown Manhattan, was unreachable from the Internet.

After a couple more occurences the culprit was identified.

The problem was something with the network switch. Michael temporarily took the switch out of the loop, connecting our router directly to Peer 1's router, and lo and behold, we were back on the Internet. Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn't.

After this experience he got to thinking about "uptime" in general, and the problems of outlier events..

Internet providers like Peer 1 like to guarantee the uptime of their services in terms of a Service Level Agreement, otherwise known as an SLA. A typical SLA might state something like "99.99% uptime." When you do the math, let's see, there are 525,949 minutes in a year, so that allows them 52.59 minutes of downtime per year. If they have any more downtime than that, the SLA usually provides for some kind of penalty.

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: "A black swan is an outlier, an event that lies beyond the realm of normal expectations." Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like "mean time between failure."

There must be a better way to deal with such events... and he discovered the five whys.

Somewhere between the "extremely unreliable" level of service, where it feels like stupid outages occur again and again and again, and the "extremely reliable" level of service, where you spend millions and millions of dollars getting an extra minute of uptime a year, there's a sweet spot, where all the expected unexpecteds have been taken care of.

To reach this sweet spot, we borrowed an idea from Sakichi Toyoda, the founder of Toyota. He calls it Five Whys. When something goes wrong, you ask why, again and again, until you ferret out the root cause. Then you fix the root cause, not the symptoms.

Applying that methodology he identified a preventative approach.

Our link to Peer1 NY went down
Why? – Our switch appears to have put the port in a failed state
Why? – After some discussion with the Peer1 NOC, we speculate that it was quite possibly caused by an Ethernet speed / duplex mismatch
Why? – The switch interface was set to auto-negotiate instead of being manually configured
Why? – We were fully aware of problems like this, and have been for many years. But - we do not have a written standard and verification process for production switch configurations.
Why? – Documentation is often thought of as an aid for when the sysadmin isn’t around or for other members of the operations team, whereas, it should really be thought of as a checklist.

"Had we produced a written standard prior to deploying the switch and subsequently reviewed our work to match the standard, this outage would not have occurred," Michael wrote. "Or, it would occur once, and the standard would get updated as appropriate."

Not only are they fixing the root cause, they are telling their customers about the problem and solutions. That creates value through increased confidence.

Instead of setting up a SLA for our customers, we set up a blog where we would document every outage in real time, provide complete post-mortems, ask the five whys, get to the root cause, and tell our customers what we're doing to prevent that problem in the future.

Wouldn't you appreciate a supplier that did this instead of simply filling our corrective action forms, probably documenting the corrective action to a problem that has occured over and over again?

You might also like

When a Form Reset Reveals a Deeper Problem

Thinking Beyond the Brain