IT Audit (8) IT Operations (29) IT Security (11)

Metrics That Matter - Part 1 - Mean Time to Repair

I'm writing a series of artcles on metrics related to Visible Ops.  I'm posting the intro to each, with link to full article.

Please leave comments.  Regards.  Gene.

High performers know that 80% of all outages are due to a change, and that 80% of mean time to repair (MTTR) is spent trying to figure out what changed. Therefore, the first question that high performers ask when a system outage occurs is "what changed?" In Visible Ops, this process of ruling out change early in the repair cycle guides the repair activities of high performers.

Contrast this behavior to how low performers work. When a system goes down, the first thing they do is reboot the server in question. If that doesn’t work, they’ll reboot the server next to it. That didn’t work? Reboot all the servers! Still not working? Reboot the firewall.

The two extremes of diagnosing and resolving outages have a dramatic impact on how quickly the problem will be found and how long the outage will last. It also serves as an incredibly accurate predictor of the processes, procedures and controls the IT organization will have in place.

Wouldn’t it be great if we had some empirical data to show how MTTR can be improved? Well, we now have it, thanks to the IT Controls Performance Study published in April 2006 by the IT Process Institute (ITPI). It surveyed 98 IT organizations on IT controls and operational performance and found the existence of foundational controls and three distinct sets of high, medium and low performers.

http://www.tripwire.com/resources/articles/index.cfm?aid=17

Published Thursday, April 12, 2007 11:59 AM by Gene Kim
Filed Under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: Metrics That Matter - Part 1 - Mean Time to Repair

Seems like you might get some cultural resistance to this. People don't like being held accountable, and especially don't like idea of being diagnosed as a "low performer". You'd have to drive this type of initiative from the top down.
Thursday, April 12, 2007 2:34 PM by Matt

# re: Metrics That Matter - Part 1 - Mean Time to Repair

Gene's observation about the re-boot approach is excellent.  In part this stems from the indiscipline around PC software as a whole - rebooting the most commonly accepted way to recover systems.
Wednesday, May 16, 2007 12:40 PM by GR

# re: Metrics That Matter - Part 1 - Mean Time to Repair

I am looking for published literature on problem resolution process. (In addition to the ITPI study)  In this post Gene contrasts high performance resolution (find out what changed) vs low performance resolution (reboot approach).

Such published literature will help to inform operations groups about which approach to adopt.

Thursday, July 19, 2007 10:19 AM by Gaurav Rampal

What do you think?

(required) 
required 
(required)