Metrics That Matter - Part 1 - Mean Time to Repair
I'm writing a series of artcles on metrics related to Visible Ops. I'm posting the intro to each, with link to full article.
Please leave comments. Regards. Gene.
High performers know that 80% of all outages are due to a change, and that 80% of mean time to repair (MTTR) is spent trying to figure out what changed. Therefore, the first question that high performers ask when a system outage occurs is "what changed?" In Visible Ops, this process of ruling out change early in the repair cycle guides the repair activities of high performers.
Contrast this behavior to how low performers work. When a system goes down, the first thing they do is reboot the server in question. If that doesn’t work, they’ll reboot the server next to it. That didn’t work? Reboot all the servers! Still not working? Reboot the firewall.
The two extremes of diagnosing and resolving outages have a dramatic impact on how quickly the problem will be found and how long the outage will last. It also serves as an incredibly accurate predictor of the processes, procedures and controls the IT organization will have in place.
Wouldn’t it be great if we had some empirical data to show how MTTR can be improved? Well, we now have it, thanks to the IT Controls Performance Study published in April 2006 by the IT Process Institute (ITPI). It surveyed 98 IT organizations on IT controls and operational performance and found the existence of foundational controls and three distinct sets of high, medium and low performers.
http://www.tripwire.com/resources/articles/index.cfm?aid=17