Observability

Turn opaque systems into ones you can ask questions of - and get answers.

When something's slow or broken, the only question that matters is 'why?' - and dashboards full of CPU graphs rarely answer it. I instrument your systems with metrics, structured logs and distributed tracing so you can ask new questions of production without shipping new code.

Done well, observability turns a multi-hour, multi-team incident hunt into a few minutes of following the evidence to the root cause.

What's included

Metrics pipelines (Prometheus, Datadog)
Structured, queryable logging
Distributed tracing across services
Dashboards that surface what matters
Correlation IDs & end-to-end visibility

6 Jun 2026

SLOs that don't lie: measuring what users actually feel

Most SLOs are green while users suffer - they measure the system, not the person. How to build SLIs from real user journeys, give each journey the target it deserves, turn the gap into a team-owned error budget, and wire alerts that drill straight to the cause.

5 Jun 2026

Designing alerts nobody ignores

Noisy alerts train your team to ignore the real one. A deep, practical guide to symptom-based, multi-window multi-burn-rate SLO alerting - the burn-rate maths, copy-pasteable PromQL, and the on-call process that makes pages trustworthy again.

Site Reliability Engineering

Incident Management & On-Call SLOs, SLIs & Error Budgets Monitoring & Alerting Performance & Load Engineering Resilience & Disaster Recovery Production Readiness Reviews Toil Reduction & Automation

Let's talk about your project.

Tell me about your system and what you're trying to achieve - I'll tell you honestly how I can help.

Start a conversation