At a high level, production alerts serve to communicate what’s happening, why it’s bad and what you should do about it. It’s important to note that these alerts may be going to people who are handling their very first alert.. or to people who are on their 10,000th alert but it’s now 4am.
To that end, an ideal structure for an alert looks like this:
The title here tells you that it’s critical, which is super helpful to know if this is advisory or a pants-on-fire kind of situation. It contextualizes the observed value with the threshold value. We have a sense of which systems are affected and what the impact of this delay can be. Lastly, we link off to useful resources to help the oncall diagnose quickly.
Examples
DB connections
Bad
This example isn’t telling us enough. It doesn’t indicate criticality, we don’t know what the downstream impacts of this are. One thing it does get right is it links us directly to a dashboard showing the issue.
Good
This is an improved version