Designing alerts nobody ignores

Every alert that fires without a human needing to do anything is quietly training your team to ignore the next one. Do that for a few weeks and the page that actually matters arrives in a channel nobody reads, or gets swiped away at 03:00 next to the seventeen that didn't. The expensive failure isn't the outage you missed - it's the on-call engineer who has learned, rationally, that alerts are background noise.

So the goal here isn't 'more monitoring'. It's the opposite: a small set of alerts that each mean a human must act now, wired to how fast you are actually consuming your error budget, with everything else routed somewhere that wakes nobody. I'll derive the maths, give you rules you can paste into Prometheus or Datadog, and cover the process that keeps it honest.

If a human doesn't need to take intelligent action, it should not page a human. Apply that one rule without mercy and most of the noise on most teams disappears.

Noise is its own reliability problem

Treat on-call like any other system with an SLO. A sane target is no more than one or two actionable pages per shift, and effectively zero false pages. Cross that and mean-time-to-acknowledge climbs, people mute channels, and trust in the whole alerting system collapses - which is far more dangerous than any single dashboard going red.

Every false page spends something you can't easily get back: the assumption that a page is real. Protect that assumption the way you'd protect production data.

Most on-call pain is a routing-and-noise problem: dozens of cause-based alerts collapse to a handful of pages that actually need a human.

Three destinations, only one wakes a human

Before tuning a single threshold, decide where signals go. Most alerting messes are really routing messes - everything was pointed at the pager.

Page - a human must act within minutes. Reserved for user-facing harm that is sustained and burning budget fast.
Ticket - a human must act, but it can wait for business hours: slow budget leaks, capacity trends, a certificate expiring next week.
Dashboard / record - awareness only. Nobody is woken; it informs the on-call view and the weekly review.

Your one-second error spike that recovered on its own belongs in the third bucket: visible on a dashboard, recorded against the budget, reviewed later. It is real, it cost you something, and it is absolutely not worth a phone call.

Decide where each signal goes before you touch a threshold. Only the page wakes someone; the recovered one-second spike belongs on a dashboard.

Alert on symptoms, not causes

Cause-based alerts - 'CPU > 80%', 'disk 90% full', 'pod restarted' - fire constantly and rarely line up with a user in pain. Modern systems run hot and restart things on purpose; that's healthy, not an incident. Symptom-based alerts page on what users actually feel:

Are requests failing? (error ratio)
Are they slow past the point users notice? (latency SLI)
Is the work getting through? (throughput, queue age, data freshness)

Alert on the symptom and you can delete most of the cause alerts: you'll learn the disk filled up because requests started failing - from one page instead of nine. Causes belong on dashboards and in runbooks, where they help you diagnose, not on the pager, where they just compete for attention.

Page on what users feel; keep the causes on dashboards and in runbooks, where they help you diagnose instead of competing for attention.

Why a single threshold can never win

The naive SLO alert is 'page me when we're burning budget'. Pick one window and you're forced into a losing trade-off. Make it short and it's twitchy - a 30-second blip wakes someone for a problem that already fixed itself. Make it long and it's sluggish - you can be in a hard outage for 20 minutes before anything fires.

This is exactly the one-second-spike problem. A brief spike consumes a sliver of budget (true, and worth recording) but it must never page - and a single threshold can't tell 'brief and recovered' from 'sustained and getting worse'. You need two timescales at once.

Burn rate, from first principles

Start from the budget. A 99.9% availability SLO over 30 days allows 0.1% of requests to fail - that is your error budget, about 43 minutes of full-outage equivalent over the month. Spend it slowly and you're fine; spend it in an afternoon and you have an incident.

Burn rate is how fast you're spending relative to 'exactly on budget'. A burn rate of 1 spends precisely 100% of the budget across the SLO window - sustainable. A burn rate of 14.4 spends 14.4× too fast: you'd torch the whole 30-day budget in about two days, and you've already burned 2% of it in the last hour.

burn_rate = (errors / total) / (1 - SLO)

# 99.9% SLO  ->  1 - SLO = 0.001
# if 1.44% of requests are failing right now:
#   burn_rate = 0.0144 / 0.001 = 14.4
#
# budget spent over a window = burn_rate * (window / SLO_period)
#   14.4 * (1h  / 720h) =  2%   of the 30-day budget, in one hour
#    6   * (6h  / 720h) =  5%   in six hours
#    1   * (72h / 720h) = 10%   in three days

Those three rows aren't arbitrary - they are the alert tiers. Decide how much budget you're willing to lose before a human gets involved, and the burn rate and window fall straight out of the maths.

Every burn-rate tier is just a slope. 14.4x empties a 30-day budget in about two days, 6x in five, 1x lands exactly at day 30 - which is where the alert thresholds come from.

Multi-window, multi-burn-rate

The fix for the single-threshold trap is to watch a long window and a short window together, at more than one burn rate. The long window confirms the problem is real and sustained; the short window - roughly one twelfth of the long one - makes the alert both quick to fire and, crucially, quick to reset once you've recovered. An alert fires only when both windows exceed the threshold.

A one-second spike trips the short window but never the long one, so it never pages - a sustained burn trips both and does. The short window also makes the alert reset within minutes of recovery, which is what ends the flapping.

Severity comes from the burn rate, not from stacking same-size windows. Fast, steep burn = wake someone now. Slow, shallow burn = a ticket for tomorrow:

SEVERITY    BURN    LONG WIN  SHORT WIN   BUDGET BURNED    ROUTE
-----------------------------------------------------------------
critical    14.4x   1h        5m          2%  in 1 hour    page
high         6x     6h        30m         5%  in 6 hours   page
warning      1x     3d        6h          10% in 3 days    ticket

Read it the way you'd describe it out loud: a steep burn over a short horizon is a critical page; a gentle burn that only shows up over days is a warning ticket for the slow leak the fast tier would miss. Run all three at once - they don't conflict, they cover different failure shapes.

The short window is the part that kills the noise. Because the alert needs the short window hot too, it stops firing within minutes of recovery instead of hanging on for the length of the long window. That is what ends the flapping - and the 3am page for something that fixed itself 90 seconds ago.

The queries

Compute the SLI once

Define the symptom once as an error ratio. In Prometheus you pre-compute it per window with recording rules - cheap, readable, and reused by every dashboard. In Datadog you define the SLI once in a metric-based SLO and it computes the windows for you.

groups:
- name: payments-slo
  rules:
  # symptom SLI: fraction of bad requests (5xx or too-slow)
  - record: job:sli_err:ratio5m
    expr: |
      sum(rate(http_requests_total{job="payments",code=~"5.."}[5m]))
      / sum(rate(http_requests_total{job="payments"}[5m]))
  - record: job:sli_err:ratio1h
    expr: |
      sum(rate(http_requests_total{job="payments",code=~"5.."}[1h]))
      / sum(rate(http_requests_total{job="payments"}[1h]))
  # ...repeat for the 30m, 6h and 3d windows you alert on

# Datadog computes the windows for you: define the SLI once as a
# metric-based SLO (good events / total events). The raw error ratio:
100 * (
  sum:trace.http.request.errors{service:payments}.as_count()
  / sum:trace.http.request.hits{service:payments}.as_count()
)

Alert rules - multi-window, multi-burn-rate

The SLO is 99.9%, so the budget denominator is 0.001. Each tier needs its long and short window to both exceed the burn rate before it fires - written out explicitly in Prometheus, and a built-in parameter of Datadog's burn-rate monitor.

- alert: PaymentsBudgetFastBurn
  expr: |
    job:sli_err:ratio1h / 0.001 > 14.4
    and
    job:sli_err:ratio5m / 0.001 > 14.4
  labels: { severity: page }
  annotations:
    summary: "Payments burning budget 14.4x (2%/h)"
    runbook: "https://runbooks/payments/error-budget"

- alert: PaymentsBudgetSlowBurn
  expr: |
    job:sli_err:ratio3d / 0.001 > 1
    and
    job:sli_err:ratio6h / 0.001 > 1
  labels: { severity: ticket }
  annotations:
    summary: "Payments slow budget leak (10%/3d)"
    runbook: "https://runbooks/payments/error-budget"

# Datadog does multi-window natively - a burn-rate monitor on the SLO,
# with the long and short windows as parameters:

# fast burn -> page  (2% of the 30-day budget in 1h)
burn_rate("payments-availability").over("30d").long_window("1h").short_window("5m") > 14.4

# slow burn -> ticket  (10% in 3 days)
burn_rate("payments-availability").over("30d").long_window("3d").short_window("6h") > 1

Drop the 6× / 6h / 30m tier in the middle and you've covered fast outages, medium degradations and slow leaks with three alerts per service - not thirty.

Every page has to earn its place

The maths gets you a clean signal; process keeps it clean. Hold every page to four tests:

Actionable - if on-call can't do something about it right now, it's a ticket or it's deleted.
Has a runbook - the page links to 'what's wrong and where to start', not a wiki search at 3am.
Owned - exactly one team receives it, and they can change it.
Novel - it says something the previous page didn't; collapse the duplicates.

Then make two habits cheap and regular. A weekly error-budget review: how much did we spend, on what, are we on track. And an alert review: walk every page that fired and, for each, keep it, fix it, or delete it. Deleting alerts is the most underrated reliability work there is.

Anti-patterns to delete this week

Static thresholds on spiky metrics ('CPU > 80%') - they fire on healthy load.
Per-host disk or memory pages - alert on the user-facing symptom; capacity-plan the rest as tickets.
'Informational' pages - a contradiction in terms; send them to a dashboard.
Alerting on a cause you already alert on the symptom of - you'll get paged twice for one incident.
Any alert with no runbook and no owner - nobody can act on it, so it is pure noise.

How to get there without flying blind

Define one SLI and SLO per user-facing journey (checkout, login, payments) - symptoms, not hosts.
Add recording rules for the windows you'll alert on.
Deploy the three burn-rate tiers in ticket-only mode first - no paging yet.
Watch for two or three weeks; compare what would have paged against what was genuinely an incident.
Promote the fast tier to paging; keep the slow tier as tickets.
Delete the cause-based pages it now makes redundant - one at a time, watching the budget.
Stand up the weekly budget review and the alert review, and keep pruning.

The aim was never 'fewer alerts' as a vanity metric. It's that when the phone rings it's real, it's actionable, and a runbook is waiting - so people trust it, answer it, and the system stays reliable because the humans defending it aren't exhausted.