alarm severity

Alarms aka alerts are different levels of bad. Here are examples of how I think of priorities:

P1/Critical

The house is on fire. Major customer-facing outage, data corruption in progress, security breach, or complete service failure affecting revenue. These are “wake everyone up” incidents that trigger immediate response regardless of time. These necessarily involve cross-team and cross-function coodination.

Example: Checkout is completely dead. Nobody can complete a purchase. The checkout service is returning 500s or timing out entirely. This is literally stopping all revenue.

P2/High

Something’s broken but the world isn’t ending. Degraded performance, partial outages, or features that aren’t working but have workarounds. During off-hours, the oncall follows the runbook and escalates if standard fixes don’t work. These might become P1s if left untreated, so we handle them, just without the five-alarm response

Example: Search is returning no results for 30% of queries. The Elasticsearch cluster is partially degraded. People can browse categories fine, but search is broken for common terms like “coconut oil” or “protein powder.”

P3/Medium

This needs fixing but can wait until tomorrow. Non-critical bugs, performance issues in non-essential features, or problems affecting internal tools. Oncalls document these for handoff and feature teams address them during core hours. Nobody’s getting paged at 2am for a P3.

Example: Admin UI is timing out. Our internal team can’t update items. This impacts internal operations but zero members. Oncall notes it in the handoff, maybe kills some long-running queries if the runbook says to, but nobody’s getting paged.

P4/Low

P4s shouldn’t exist in the alerting system. If something is P4, it shouldn’t page anyone, anywhere, ever. These are things that should be tracked in error logs, show up in dashboards, or get caught in weekly metrics reviews.

Example: Performance degradation that doesn’t impact SLOs. If a page loads in 2.5 seconds instead of 2 seconds, and our SLO is 3 seconds, that’s monitoring data for performance optimization, not an incident.

The notes of Justin Abrahms

Recently updated

latency is not normal(ly distributed)

incident severity

Standard Deviation

Explorer

alarm severity

P1/Critical

P2/High

P3/Medium

P4/Low

Graph View

Table of Contents

Backlinks