SREcon

https://thevoid.community

Shallow data

duration

  • your data is skewed (e.g. not normal distribution) which renders averages meaningless (left skewed distribution)
  • this ^^ means you can’t take the “mean” of it.

mttr

  • hopefully:
    • how reliable your software/systems are
    • how agile/effective the team/org is
    • how your trajectory of improvement is
    • predict next one’s length
    • how “bad” any given incident is
  • In actuality, those are a lie.
  • “recovery” isn’t quite definitive (mitigation? or rollback? Or actual fix?)

severity

  • “severity is negotiable”
    • customer impact?
    • effort to fix?
    • urgency?
  • sometimes automated
  • sometimes updates
  • sometimes subjectively assigned
  • often gamed (to get assistance / avoid post-incident review / etc)

Severity and duration aren’t statistically correlated.

root cause

  • simplifies causality in complex systems
  • fails to identify upstream & other systemic factors
  • over-indexes on human decisions/actions

Depths

Rich details: Incident stories

  • contain rich sociotechnical detail compared to linear accounts focused only on technical elements
  • convey multiple, different perspectives
  • reveal themes & patterns
  • zoom in and out to paint a picture of the whole system

Incident stories allow you to better to understand the “vibe” of people during the incident, which can uncover unease about our systems.

safety boundaries

Rasmussen, 1997

Safety bounaries (graph of an “operating point” within the bounds of the following graphed lines):

  • acceptable performance boundary ()
  • economic failure boundary (keeping the lights on)
  • unnaceptable workload boundary (not burning people out)

Q: That’s interesting. ^^ Are the boundaries are always the same?

near misses

they reveal

  • gaps in knowledge
  • communication breakdowns
  • pockets of expertise
  • misaligned mental models
  • cultural/political forces
  • range of assumptions that practitioners have about their systems

communities

incident analysis is hard work.

Sharing these things:

  • builds a community of practice
  • shines a light on the importance of the work
  • supports/educates/shows caring for each other