Talk at SREcon
Value of incident reports:
- communication
- accountability
- coordination
- (hopefully) learning
— Hypothesis: after-hours will have high MTTR and complexity.
- benefits: falsifiable
- built on “shaky ground” of MTTR
- even shakier ground: measuring complexity
Learnings:
- most incidents don’t happen at night
- MTTR isn’t useful in aggregate.
Complexity: “How hard was this to fix? Did it have a clear and obvious resolution? Were senior engineers required to fix it? Graded on a 1-5 scale, 1 being fairly simple, 5 being the hardest known solution” <- shenanigans metric.
They tracked the prevalence of postmortems.
- Only managed to hit 55% postmortem achievement
- started program to increase postmortem writing
—
Hypothesis: postmortems are the norm
- barely falsifiable (define “norm”)
- easy to measure from a binary perspective
Finding: Went from 55% -> 62%.
User researchers were engaged for internal developers.
- “ambiguous, hard, and I haven’t been trained”
Then they made training.
Lower severity things didn’t have postmortems.
—
How to do it:
- find some artifacts
- time-box studying the artifacts
- hypothesize
- make a methodology which fits in your time box
- run it by a data scientist
- break up the artifacts into a list and study each one
- analyze
- write up the results, learn, share, rejoice.
Notes:
- correlations are problematic. Sample size is small, population is unknown, sane pvalues are hard to come by. It’s more of a census.
Things they found:
- “Nobody knows what the ‘start’ or ‘end’ of an incident is”. Detection? Impact?
- 80% of incidents happen during business hours
- uptime success can hide big problems with productivity (e.g. burning people out)
- 30% of declared incidents are local change failures (it’s mostly the environment changing)
—
Q: something you mentioned suggested the oncall report might be useful mid-event?