SLOs that don't lie: measuring what users actually feel
Most reliability dashboards are green while users are swearing at their screens. The SLO says 99.95%, the CPU graphs are calm — and checkout has been failing for ten minutes. That gap is the tell: the SLO is measuring the system, not the person using it. An SLO only tells the truth when it measures what the user actually feels.
Measure what the user feels, not the machine
CPU, memory and host uptime are not user experience. A box can sit at 30% CPU while every request times out; it can run at 95% and serve everyone perfectly. The only judge of whether your service works is the user — so the indicator has to come from their side of the wire. Build SLOs on the journey (did the request succeed, was it fast enough, was the answer correct and recent), never on the infrastructure underneath.
A service level indicator (SLI) is one such number: the proportion of good events over total events. Most services need only three:
- Availability — the share of valid requests that succeeded.
- Latency — the share served faster than a threshold users notice, e.g. 95% under 300 ms.
- Quality or freshness — for data and async work, how correct or how recent the result is.
Two habits keep these honest. Use percentiles, not averages — a healthy mean hides the slow tail that real users land in, so quote p95/p99. And measure as close to the user as you can — the load balancer, the edge, real-user monitoring — because server-side metrics never see DNS failures, CDN problems, or the request that never arrived.
If percentiles feel fuzzy: p95 latency is the response time that 95% of requests come in under — so 1 in 20 users waits longer — and p99 is the slowest 1%. An average blends the fast majority and the slow few into a single number nobody actually experiences; a percentile keeps that tail in view.
Averages hide broken users as easily as slow ones. A 99.5% success rate reads like a rounding error away from perfect — but across real traffic that fraction is a steady stream of people hitting failures, usually piled into one segment the global number can't see. Always be able to break an SLI down by journey, platform and region.
One SLO per journey — owned and tagged
SLOs follow user journeys, not your org chart and not individual services. Define one per journey — checkout, login, search — and resist rolling them into a single site-wide number: a global SLO stays green while checkout is broken, because healthy static assets drown out the failures that matter. And give each journey the target it deserves; a payment path and a recommendations widget do not need the same reliability.
The error budget — the gap between your target and 100% — then becomes the owning team's currency. Teams are largely isolated, and some surfaces absorb far more failure than others, so let each team hold and spend its own budget: it buys velocity when there's room and forces a focus on reliability when it's gone. None of this works without disciplined tagging — every metric, resource and alert labelled with service, team and journey — so each SLI rolls up to the right owner and a burning budget points straight at the team who can fix it.
Set a target you can defend
Don't reach for 99.99% by reflex. Each extra nine is exponentially more expensive, and a target you won't actually fund is just another lie on the dashboard. Pick the number the journey genuinely needs — and keep your internal SLO stricter than any SLA you've signed, so you find out before the customer does.
An SLO of 99.9% over 30 days allows roughly 43 minutes of 'bad' per month. That number is the whole point: it turns 'are we reliable enough?' from an argument into arithmetic.
And don't try to nail the perfect number on day one — you can't. Set it too high and you live permanently out of budget; too low and it means nothing. Start by measuring where you actually are, set the target just above today's reality, and treat it as a moving floor: at each weekly or monthly review the SRE team looks at what it held and, if the service has improved, raises the bar a notch. The SLO ratchets upward sprint by sprint — every step a level you genuinely sustained, not one you wished for.
Make the budget a decision rule
An error budget only earns its keep when it changes behaviour, and the rule has to be agreed in advance. Budget left over: ship, take the risky migration, run the load test in prod. Budget spent: freeze risky changes and put the next sprint into reliability until you're back in the black. Reviewed weekly, it turns reliability from a feeling into a shared, owned decision.
Alert on the journey, navigate to the cause
Because the SLO measures the journey, that is the only thing worth paging on. The alert fires on a user-facing budget burn and opens the journey dashboard — success rate, p95, budget remaining. From there you navigate down: to the service dashboards for the components on the path, then to the dependent resources — database, cache, queue, upstream API. Seeing those dependencies is invaluable for diagnosis; it is never a reason to page. Consistent tags are what make that drill-down possible — they let one templated dashboard link to the next, instead of leaving you with a pile of disconnected screens.
The SLO is also the input to your alerting: feed the budget burn rate into multi-window alerts, so a brief blip is a footnote and a sustained burn is a page (I cover that mechanism in 'Designing alerts nobody ignores'). The dashboard chain then carries you from the page to the cause in three clicks, not thirty.
Compute it automatically
None of this should be hand-maintained. Compute SLIs from the telemetry you already have — Prometheus, Datadog, load-balancer logs — as recording rules, and surface one panel per journey: current SLI, target, budget remaining, burn rate. If producing the number is manual it will rot; if it's automatic it becomes the thing everyone checks first.
Under the hood the SLI is one small query: count the requests, count the errors, divide. Define it once as a recording rule and every dashboard and alert reuses the same number.
# A — requests over the window
- record: checkout:requests:rate5m
expr: sum(rate(http_requests_total{job="checkout"}[5m]))
# B — errors (HTTP 5xx) over the same window
- record: checkout:errors:rate5m
expr: sum(rate(http_requests_total{job="checkout", code=~"5.."}[5m]))
# C — error percentage = B / A (the SLI's "bad" share)
- record: checkout:error_pct5m
expr: 100 * checkout:errors:rate5m / checkout:requests:rate5mReview and evolve
SLOs are living promises, not a one-off spreadsheet. Revisit the targets as the product and users' expectations change. A budget you never burn means the target is too low — or you're over-investing in reliability nobody asked for; a budget you blow every month means the target is unrealistic or the service needs real work. The right SLO sits where it occasionally, usefully, hurts.
An SLO that measures the user, set to a number you'll actually fund, owned by the team that can move it, and wired to alerts that drill straight to the cause — that is an SLO that tells the truth. Everything else is a green light over a burning building.