incident archeology

Value of incident reports:

communication
accountability
coordination
(hopefully) learning

— Hypothesis: after-hours will have high MTTR and complexity.

benefits: falsifiable
built on “shaky ground” of MTTR
even shakier ground: measuring complexity

Learnings:

most incidents don’t happen at night
MTTR isn’t useful in aggregate.

Complexity: “How hard was this to fix? Did it have a clear and obvious resolution? Were senior engineers required to fix it? Graded on a 1-5 scale, 1 being fairly simple, 5 being the hardest known solution” <- shenanigans metric.

They tracked the prevalence of postmortems.

Only managed to hit 55% postmortem achievement
started program to increase postmortem writing

—

Hypothesis: postmortems are the norm

barely falsifiable (define “norm”)
easy to measure from a binary perspective

Finding: Went from 55% -> 62%.

User researchers were engaged for internal developers.

“ambiguous, hard, and I haven’t been trained”

Then they made training.

Lower severity things didn’t have postmortems.

—

How to do it:

find some artifacts
time-box studying the artifacts
hypothesize
make a methodology which fits in your time box
run it by a data scientist
break up the artifacts into a list and study each one
analyze
write up the results, learn, share, rejoice.

Notes:

correlations are problematic. Sample size is small, population is unknown, sane pvalues are hard to come by. It’s more of a census.

Things they found:

“Nobody knows what the ‘start’ or ‘end’ of an incident is”. Detection? Impact?
80% of incidents happen during business hours
uptime success can hide big problems with productivity (e.g. burning people out)
30% of declared incidents are local change failures (it’s mostly the environment changing)

—

Q: something you mentioned suggested the oncall report might be useful mid-event?

The notes of Justin Abrahms

Recently updated

Sprint Ceremony input/outputs

Calculating velocity

Story points

Explorer

incident archeology

Graph View

Backlinks