Getting to the Rootiest Cause.

June 22, 2010

Filed under: SQLServerPedia Syndication,Technical Articles — Tags: root cause analysis — Michael J. Swart @ 5:00 am

Well it’s self-evaluation time at work and there’s a lot of writing and professional development talk floating around. One of the terms that gets used extremely often is “Root Cause Analysis”. Everyone talks about it and everyone believes they’re on the same page. I’m going to give my take.

First listen to this conversation with a toddler and a parent on the topic of bedtime…

Toddler: Why can’t you read more?
Parent: Because it’s bedtime.
Toddler: Why?
Parent: Because you need your sleep.
Toddler: Why?
Parent: Otherwise you get cranky tomorrow morning.
Toddler: Why?
etc., etc., etc….

@JoeWebb follows the advice of “asking ‘why’ at least 5 times before you begin thinking about a solution.” By asking why enough times, we try to get to the rootiest cause of the problem.

The trick is finding the reasons to each “why” and knowing when to stop asking “why”. This is a human problem and each situation is different. I don’t think there are any rules of thumb or flowchart here that will work for every situation (including the 5 why’s) . But I can offer a few things I’ve learned through experience.

There May Be More Than One “Because” to a “Why”

There is often more than one answer to a why and we shouldn’t let one reason stop us from looking at other reasons:

Why did the app crash?
Because there was no foreign key relationship between ObjectTable and ContainerTable.
Why?
Because the design wasn’t reviewed.
etc…

What’s missing here is maybe some other reasons like: “Because QA didn’t catch it” Or “Because the app is sending data it shouldn’t”

Don’t Kid Yourself

Richard Feynman gave an amazing commencement address in 1974 called Cargo Cult Science. It’s a great read and I recommend it to anyone who has 10 minutes to spare.

In it he uses the term to negatively describe certain schools of thought that appear to be doing science, by following the same methods but are missing something fundamental. Feynman says we can avoid cargo cult science by adhering to a kind of “utter honesty” or “scientific integrity”.

Now I think the exact same thing applies to IT and in particular to toubleshooting problems. The danger being that we settle on a root cause without enough data. Or we might make an unwarranted assumption that after a fix gets rolled out, that the problem will not re-occur. It sounds like a no-brainer, but in reality the danger is real, especially when egos are at stake.

The sad thing is that people (including myself, sometimes especially myself) who are kidding themselves don’t know they’re kidding themselves.

It’s Not Turtles All the Way Down

Like the toddler playing the why game, it’s easy to keep playing forever. But at some point you have to stop. It’s not turtles all the way down. The trick is knowing when to stop.

Common sense rules here, I think can stop asking why when the problem switches out of the technical domain or your business domain. If you get to a cause like the following, you can probably stop:

Michael J. Swart

June 22, 2010