Systems Sunday · Agent Reliability

Systems Sunday: Applying SRE Principles to AI Agents — Runbooks, Incident Response, and the New Ops Stack

Published March 15, 2026 — 9 min read

TL;DR: Site reliability engineering (SRE) was built to manage systems that are deterministic but complex. AI agents are neither — they're probabilistic, context-sensitive, and capable of surprising you in production. The good news: SRE principles translate directly to agent ops with some modifications. This post covers how to adapt the SRE playbook — runbooks, incident response, SLOs, and on-call escalation — for the age of autonomous AI agents.

The Problem: Agents Break Differently Than Services

Traditional SRE knows how to handle a crashed service: check the logs, find the error, restart or roll back. The failure is deterministic — the same input produces the same failure, and the fix is reproducible.

AI agents fail differently:

The same input produces different outputs depending on model state, context window, retrieved documents, and tool availability
"Degraded performance" is hard to detect — an agent can keep running, returning plausible-looking outputs, while quietly doing the wrong thing
Root cause is often a prompt change, a data quality issue, or a tool call that succeeded but returned unexpected data — not a crash

This means traditional SRE tooling catches the obvious failures (the agent crashes, the API times out) but misses the subtle ones (the agent writes wrong data confidently, the agent starts routing tasks to the wrong tool).

The SRE Tooling Shift: AI Is Coming to Ops Too

Three releases this week signal that the ops tooling category is adapting.

Microsoft Azure SRE Agent went GA (Microsoft Community Hub, March 2026). It automates multi-layer root cause analysis across apps, platform, and infrastructure — correlating telemetry, testing hypotheses, and explaining findings. For teams already running agents on Azure, this is the first native tool that brings AI-assisted incident investigation into the platform.

PagerDuty expanded its AI ecosystem to 30+ partners (BusinessWire, March 12), with a "context flywheel" that feeds observability telemetry into automated triage. Their SRE Agent can automatically correlate alerts and accelerate root cause analysis. They're also shipping pre-commit risk scoring directly in IDEs — catching risky agent code changes before they reach production.

Rootly and StackGen are among a growing cluster of AI SRE tools built around reducing mean time to resolution (MTTR) for complex, distributed systems — including systems where AI agents are part of the stack.

The common thread: ops tooling is being rebuilt to handle systems where agents are both the operators and the components being operated.

Adapting SRE for Agent Ops

1. Write Runbooks for Agent Failure Modes, Not Just Service Failures

A traditional runbook covers: "Service X is down. Check these logs. Run this restart command." An agent runbook needs to cover:

Silent degradation: "Agent outputs look normal but metrics suggest wrong routing — check the tool call trace for the last 50 runs"
Prompt injection response: "Agent is behaving unexpectedly after processing external content — isolate the session, audit the context window, roll back to previous system prompt version"
Cost spiral: "Agent retry loop detected — circuit breaker should have fired; if not, manual kill switch is [X], check rate limit config" (see our deep-dive on retry logic and circuit breakers)
Data quality failure: "Agent is writing low-confidence outputs to the database — check the confidence threshold config, compare recent outputs against eval set"

Write these before you need them. The 2 AM incident is not when you want to figure out your agent's failure taxonomy.

2. Define Agent SLOs Differently

Traditional SLOs measure uptime, latency, and error rate. For agents, add:

Task completion rate: % of initiated tasks that reach a successful terminal state
Human escalation rate: % of tasks that trigger an interrupt or require human review (too high = approval fatigue; too low = the agent is overconfident)
Output quality rate: % of outputs that pass your eval set (requires a continuous eval pipeline)
Tool call accuracy: % of tool calls that return expected response shapes vs. errors or unexpected formats
Cost per successful completion: total spend divided by tasks fully completed — rising cost/completion ratio is an early signal of degradation

3. On-Call for Agent Systems

On-call for an agent system requires different instincts:

Don't restart first — a crashed agent may have left partial writes in external systems. Understand what it was doing before you restart it.
Audit before remediation — check the full tool call trace for the session that failed. Was it a transient API error, a bad context state, or a systematic failure mode?
Check for data corruption — if the agent was writing to a CRM, database, or external system, validate the last N writes before declaring the incident resolved
Root cause the prompt, not just the code — if the agent started failing after a prompt change, that IS the bug. Treat prompt versions like code commits.

4. Chaos Engineering for Agents

Traditional chaos engineering: kill a service and see if the system recovers. Agent chaos engineering:

Inject bad tool responses: return malformed data from one tool and verify the agent interrupts rather than proceeding with corrupt state
Simulate context window overflow: see how the agent behaves when its context is near the limit — does it degrade gracefully or start hallucinating?
Test prompt injection resilience: send adversarial content through normal input channels and verify the guardrails fire
Starve the retrieval system: return empty or irrelevant RAG results and verify the agent escalates rather than fabricating context

These tests should run in your staging environment before any agent promotion to production.

The Practical Starting Point

If you're running agents in production without an ops framework, here's the minimum viable SRE posture:

Incident taxonomy: write down your agent's known failure modes and what "bad" looks like for each one
Runbooks: one page per failure mode, written before the incident
Alerting on agent-specific metrics: task completion rate, cost/completion, human escalation rate — not just infrastructure metrics
A kill switch: a way to halt the agent and prevent new task starts, separate from infrastructure-level restarts
Post-incident reviews that include prompt history: if something went wrong, the first question should be "was there a prompt change in the 48 hours before this?"

FAQ

What is an AI SRE agent?

An AI SRE agent is an autonomous system designed to assist with site reliability engineering tasks — specifically root cause analysis, alert correlation, incident investigation, and remediation recommendation. Microsoft's Azure SRE Agent (GA March 2026) and PagerDuty's SRE Agent are current examples. They're designed to accelerate MTTR by automating the investigation steps that traditionally require experienced engineers.

How do you define SLOs for AI agents?

AI agent SLOs should include metrics that traditional SLOs miss: task completion rate (% of tasks that reach a successful terminal state), output quality rate (% passing eval checks), human escalation rate (% requiring human review), tool call accuracy (% returning expected response shapes), and cost per successful completion. Tracking these alongside standard latency and error rate metrics gives a complete picture of agent reliability.

How is AI agent incident response different from traditional incident response?

The key differences: agents may leave partial writes in external systems when they fail, so auditing before restarting is critical; root cause often lies in a prompt change rather than code; "silent degradation" (the agent keeps running but produces wrong outputs) requires quality monitoring to detect, not just uptime checks; and restoring service may require validating data integrity in external systems the agent touched.

What should an AI agent runbook cover?

Agent runbooks should document failure modes specific to agent behavior: silent degradation (outputs look normal but quality has dropped), prompt injection response (agent behaves unexpectedly after processing external content), cost spiral (retry loops consuming unbounded budget), and data quality failure (agent writing low-confidence outputs). Each runbook entry should include detection signals, immediate containment steps, root cause investigation procedure, and rollback path.

What is chaos engineering for AI agents?

Chaos engineering for AI agents involves deliberately injecting failure conditions to verify the system's resilience: returning malformed data from tools to test interrupt logic, simulating context window overflow to test graceful degradation, injecting adversarial prompt content to test guardrail effectiveness, and starving the retrieval system to verify the agent escalates rather than fabricating. These tests should run in staging before any agent is promoted to production.

How do you handle on-call for AI agent systems?

On-call for agent systems requires a different default instinct than traditional service on-call: don't restart first (the agent may have left partial writes in external systems), audit the full tool call trace before taking any action, check for data corruption in systems the agent touched, and treat prompt changes as potential root causes with the same rigor as code changes.

Systems Sunday: Applying SRE Principles to AI Agents — Runbooks, Incident Response, and the New Ops Stack

The Problem: Agents Break Differently Than Services

The SRE Tooling Shift: AI Is Coming to Ops Too

Adapting SRE for Agent Ops

1. Write Runbooks for Agent Failure Modes, Not Just Service Failures

2. Define Agent SLOs Differently

3. On-Call for Agent Systems

4. Chaos Engineering for Agents

The Practical Starting Point

FAQ

What is an AI SRE agent?

How do you define SLOs for AI agents?

How is AI agent incident response different from traditional incident response?

What should an AI agent runbook cover?

What is chaos engineering for AI agents?

How do you handle on-call for AI agent systems?

Sources