Designing SLIs That Actually Matter
Most teams track the wrong signals. Here's how to choose SLIs that reflect user experience and drive meaningful SLO conversations.
Not all metrics deserve SLO treatment. The difference between a useful SLI and dashboard noise comes down to one question: does this metric reflect a user-facing experience?
Start with the user journey
Before opening your metrics tooling, map the critical paths users take through your system. For an API platform, that might be authentication, core CRUD operations, and async job completion. Each path gets its own SLI candidates — but only if a degradation directly impacts the user.
Good SLI candidates:
- Request success rate — ratio of successful responses to valid requests
- Latency at the edge — p99 from the client's perspective, not internal service-to-service hops
- Freshness — how stale can data be before users notice?
Avoid vanity metrics like CPU utilization or pod count unless they directly correlate with user pain.
The three SLI types
Google's SRE workbook defines three categories that cover most systems:
- Availability — Is the system responding successfully?
- Latency — Is it responding fast enough?
- Quality — Is the response correct and complete?
Pick one primary SLI per user journey. Adding more creates alert fatigue without improving reliability decisions.
Setting error budgets that stick
An error budget is only useful if leadership treats it as a real constraint. When the budget is exhausted, feature work stops and reliability work begins — no exceptions.
# Example SLO definition
slo:
target: 99.9%
window: 30d
slis:
- name: api_availability
good_events: http_requests{status!~"5.."}
total_events: http_requests
Make error budget burn visible in your team's weekly review. When burn rate spikes, the conversation should shift from "who broke prod" to "what trade-off are we making."
Security as an SLI dimension
Availability SLIs often ignore security events. A successful response that leaks data is still a failure. Consider augmenting quality SLIs with:
- Authentication failure rates (detecting brute force)
- Policy violation counts from OPA or Kyverno
- Certificate expiry windows
Reliability and security share the same foundation: knowing when your system is not behaving as intended.
What to do next
Audit your current dashboards. For each metric, ask whether a user would notice if it degraded. If the answer is no, demote it from SLO consideration and keep it for debugging only.
Need help designing SLOs for your stack? Get in touch — we embed security and reliability metrics into a single operational framework.