At a high level, production alerts serve to communicate what’s happening, why it’s bad and what you should do about it. It’s important to note that these alerts may be going to people who are handling their very first alert.. or to people who are on their 10,000th alert but it’s now 4am.

To that end, an ideal structure for an alert looks like this:

CRITICAL alert: kafka lag exceeding 20s on order_sync
---
Observing 20s delay (threshold: 2s) for processing kafka messages on
order_sync topic.
 
Impact:
- Increased likelihood of database deadlocks
- Customer checkout is impacted
 
Affected systems:
- Checkout
- Fulfillment
 
Dashboard: <link>
Runbook: <link>

The title here tells you that it’s critical, which is super helpful to know if this is advisory or a pants-on-fire kind of situation. It contextualizes the observed value with the threshold value. We have a sense of which systems are affected and what the impact of this delay can be. Lastly, we link off to useful resources to help the oncall diagnose quickly.

Examples

DB connections

Bad

#20861: [New Relic] tm-prd-db-20180725215753706800000001-1 query result is > 200.0 for 5 minutes on 'tm-prd-magento-connection-count'
Policy Name : prod-DB
Target Name : DatastoreSample query
Target Type : Query
Target Product : NRQL
Target Link : <link>

This example isn’t telling us enough. It doesn’t indicate criticality, we don’t know what the downstream impacts of this are. One thing it does get right is it links us directly to a dashboard showing the issue.

Good

This is an improved version

INFO: DB connections have hit more than 10% of max
---
Database: tm-prd-db-20180725215753706800000001-1
 
This probably means a traffic spike and not a problem. But like.. validate?
 
Affected systems:
- Primary Web App
 
Potential concerns:
- DDOS attack?
- Rogue process making lots of db connections?
 
Dashboard: <link>
Runbook: <link>