Systems Sunday · Agent Reliability

Systems Sunday: Applying SRE Principles to AI Agents — Runbooks, Incident Response, and the New Ops Stack

Published March 15, 2026 — 9 min read

TL;DR: Site reliability engineering (SRE) was built to manage systems that are deterministic but complex. AI agents are neither — they're probabilistic, context-sensitive, and capable of surprising you in production. The good news: SRE principles translate directly to agent ops with some modifications. This post covers how to adapt the SRE playbook — runbooks, incident response, SLOs, and on-call escalation — for the age of autonomous AI agents.

The Problem: Agents Break Differently Than Services

Traditional SRE knows how to handle a crashed service: check the logs, find the error, restart or roll back. The failure is deterministic — the same input produces the same failure, and the fix is reproducible.

AI agents fail differently:

This means traditional SRE tooling catches the obvious failures (the agent crashes, the API times out) but misses the subtle ones (the agent writes wrong data confidently, the agent starts routing tasks to the wrong tool).

The SRE Tooling Shift: AI Is Coming to Ops Too

Three releases this week signal that the ops tooling category is adapting.

Microsoft Azure SRE Agent went GA (Microsoft Community Hub, March 2026). It automates multi-layer root cause analysis across apps, platform, and infrastructure — correlating telemetry, testing hypotheses, and explaining findings. For teams already running agents on Azure, this is the first native tool that brings AI-assisted incident investigation into the platform.

PagerDuty expanded its AI ecosystem to 30+ partners (BusinessWire, March 12), with a "context flywheel" that feeds observability telemetry into automated triage. Their SRE Agent can automatically correlate alerts and accelerate root cause analysis. They're also shipping pre-commit risk scoring directly in IDEs — catching risky agent code changes before they reach production.

Rootly and StackGen are among a growing cluster of AI SRE tools built around reducing mean time to resolution (MTTR) for complex, distributed systems — including systems where AI agents are part of the stack.

The common thread: ops tooling is being rebuilt to handle systems where agents are both the operators and the components being operated.

Adapting SRE for Agent Ops

1. Write Runbooks for Agent Failure Modes, Not Just Service Failures

A traditional runbook covers: "Service X is down. Check these logs. Run this restart command." An agent runbook needs to cover:

Write these before you need them. The 2 AM incident is not when you want to figure out your agent's failure taxonomy.

2. Define Agent SLOs Differently

Traditional SLOs measure uptime, latency, and error rate. For agents, add:

3. On-Call for Agent Systems

On-call for an agent system requires different instincts:

4. Chaos Engineering for Agents

Traditional chaos engineering: kill a service and see if the system recovers. Agent chaos engineering:

These tests should run in your staging environment before any agent promotion to production.

The Practical Starting Point

If you're running agents in production without an ops framework, here's the minimum viable SRE posture:

  1. Incident taxonomy: write down your agent's known failure modes and what "bad" looks like for each one
  2. Runbooks: one page per failure mode, written before the incident
  3. Alerting on agent-specific metrics: task completion rate, cost/completion, human escalation rate — not just infrastructure metrics
  4. A kill switch: a way to halt the agent and prevent new task starts, separate from infrastructure-level restarts
  5. Post-incident reviews that include prompt history: if something went wrong, the first question should be "was there a prompt change in the 48 hours before this?"

FAQ

What is an AI SRE agent?

An AI SRE agent is an autonomous system designed to assist with site reliability engineering tasks — specifically root cause analysis, alert correlation, incident investigation, and remediation recommendation. Microsoft's Azure SRE Agent (GA March 2026) and PagerDuty's SRE Agent are current examples. They're designed to accelerate MTTR by automating the investigation steps that traditionally require experienced engineers.

How do you define SLOs for AI agents?

AI agent SLOs should include metrics that traditional SLOs miss: task completion rate (% of tasks that reach a successful terminal state), output quality rate (% passing eval checks), human escalation rate (% requiring human review), tool call accuracy (% returning expected response shapes), and cost per successful completion. Tracking these alongside standard latency and error rate metrics gives a complete picture of agent reliability.

How is AI agent incident response different from traditional incident response?

The key differences: agents may leave partial writes in external systems when they fail, so auditing before restarting is critical; root cause often lies in a prompt change rather than code; "silent degradation" (the agent keeps running but produces wrong outputs) requires quality monitoring to detect, not just uptime checks; and restoring service may require validating data integrity in external systems the agent touched.

What should an AI agent runbook cover?

Agent runbooks should document failure modes specific to agent behavior: silent degradation (outputs look normal but quality has dropped), prompt injection response (agent behaves unexpectedly after processing external content), cost spiral (retry loops consuming unbounded budget), and data quality failure (agent writing low-confidence outputs). Each runbook entry should include detection signals, immediate containment steps, root cause investigation procedure, and rollback path.

What is chaos engineering for AI agents?

Chaos engineering for AI agents involves deliberately injecting failure conditions to verify the system's resilience: returning malformed data from tools to test interrupt logic, simulating context window overflow to test graceful degradation, injecting adversarial prompt content to test guardrail effectiveness, and starving the retrieval system to verify the agent escalates rather than fabricating. These tests should run in staging before any agent is promoted to production.

How do you handle on-call for AI agent systems?

On-call for agent systems requires a different default instinct than traditional service on-call: don't restart first (the agent may have left partial writes in external systems), audit the full tool call trace before taking any action, check for data corruption in systems the agent touched, and treat prompt changes as potential root causes with the same rigor as code changes.

Sources