Designing SLIs That Actually Matter

Not all metrics deserve SLO treatment. The difference between a useful SLI and dashboard noise comes down to one question: does this metric reflect a user-facing experience?

Start with the user journey

Before opening your metrics tooling, map the critical paths users take through your system. For an API platform, that might be authentication, core CRUD operations, and async job completion. Each path gets its own SLI candidates — but only if a degradation directly impacts the user.

Good SLI candidates:

Request success rate — ratio of successful responses to valid requests
Latency at the edge — p99 from the client's perspective, not internal service-to-service hops
Freshness — how stale can data be before users notice?

Avoid vanity metrics like CPU utilization or pod count unless they directly correlate with user pain.

The three SLI types

Google's SRE workbook defines three categories that cover most systems:

Availability — Is the system responding successfully?
Latency — Is it responding fast enough?
Quality — Is the response correct and complete?

Pick one primary SLI per user journey. Adding more creates alert fatigue without improving reliability decisions.

Setting error budgets that stick

An error budget is only useful if leadership treats it as a real constraint. When the budget is exhausted, feature work stops and reliability work begins — no exceptions.

# Example SLO definition
slo:
  target: 99.9%
  window: 30d
  slis:
    - name: api_availability
      good_events: http_requests{status!~"5.."}
      total_events: http_requests

Make error budget burn visible in your team's weekly review. When burn rate spikes, the conversation should shift from "who broke prod" to "what trade-off are we making."

Security as an SLI dimension

Availability SLIs often ignore security events. A successful response that leaks data is still a failure. Consider augmenting quality SLIs with:

Authentication failure rates (detecting brute force)
Policy violation counts from OPA or Kyverno
Certificate expiry windows

Reliability and security share the same foundation: knowing when your system is not behaving as intended.

What to do next

Audit your current dashboards. For each metric, ask whether a user would notice if it degraded. If the answer is no, demote it from SLO consideration and keep it for debugging only.

Need help designing SLOs for your stack? Get in touch — we embed security and reliability metrics into a single operational framework.