Talk at SREcon

Value of incident reports:

  1. communication
  2. accountability
  3. coordination
  4. (hopefully) learning

— Hypothesis: after-hours will have high MTTR and complexity.

  • benefits: falsifiable
  • built on “shaky ground” of MTTR
  • even shakier ground: measuring complexity

Learnings:

  • most incidents don’t happen at night
  • MTTR isn’t useful in aggregate.

Complexity: “How hard was this to fix? Did it have a clear and obvious resolution? Were senior engineers required to fix it? Graded on a 1-5 scale, 1 being fairly simple, 5 being the hardest known solution” <- shenanigans metric.

They tracked the prevalence of postmortems.

  • Only managed to hit 55% postmortem achievement
  • started program to increase postmortem writing

Hypothesis: postmortems are the norm

  • barely falsifiable (define “norm”)
  • easy to measure from a binary perspective

Finding: Went from 55% -> 62%.

User researchers were engaged for internal developers.

  • “ambiguous, hard, and I haven’t been trained”

Then they made training.

Lower severity things didn’t have postmortems.

How to do it:

  1. find some artifacts
  2. time-box studying the artifacts
  3. hypothesize
  4. make a methodology which fits in your time box
  5. run it by a data scientist
  6. break up the artifacts into a list and study each one
  7. analyze
  8. write up the results, learn, share, rejoice.

Notes:

  • correlations are problematic. Sample size is small, population is unknown, sane pvalues are hard to come by. It’s more of a census.

Things they found:

  • “Nobody knows what the ‘start’ or ‘end’ of an incident is”. Detection? Impact?
  • 80% of incidents happen during business hours
  • uptime success can hide big problems with productivity (e.g. burning people out)
  • 30% of declared incidents are local change failures (it’s mostly the environment changing)

Q: something you mentioned suggested the oncall report might be useful mid-event?