Systems Sunday · Agent Reliability

When Agents Fail: Retry Logic, Circuit Breakers, and Dead Letter Queues for AI Pipelines

Your AI agents will fail. The question is whether your system fails with them. Here's a vendor-neutral guide to the four patterns that keep production agentic pipelines resilient when things go sideways.

Published March 8, 2026 — 10 min read

Here's the scenario: your content enrichment pipeline ran fine in staging. It ran fine in the first two weeks of production. Then on a Tuesday at 3 AM, the external API it depends on returned a 503 for eight minutes. Your agent didn't retry. It didn't log anything useful. It silently dropped 140 records and moved on. You found out four days later when someone noticed the CRM gaps.

This isn't a rare edge case. It's the default behavior of most agent pipelines that were built fast — which is most of them. The question isn't whether your agents will encounter failures. They will. The question is whether your system is designed to survive them.

This post covers the four systems-level patterns every production agentic pipeline should implement: retry logic with backoff, circuit breakers, dead letter queues, and idempotency. All vendor-neutral. All applicable regardless of what orchestration tool or LLM provider you're using.

TL;DR

AI agent failures are not exceptional events — they're expected outcomes in any distributed system, and they compound fast in multi-agent pipelines. A 98% per-agent success rate across five sequential agents produces only ~90% end-to-end reliability without fault tolerance.

The four patterns that fix this are: exponential backoff with jitter (for transient failures), circuit breakers (to stop hammering a degraded service), dead letter queues (to capture failed tasks for human review), and idempotent agent actions (so safe retries don't cause duplicate side effects).

None of these require a specific vendor or framework. You can implement a working version of all four this week using whatever stack you're already running. The teams that do this before their first incident aren't just more reliable — they're significantly cheaper to operate at scale.

The math nobody wants to look at

Before getting into the patterns, it's worth understanding why this matters more for AI pipelines than for traditional software. The short version: probabilities compound.

If each agent in your pipeline has a 98% success rate per task — which is pretty good — a five-agent chain has an end-to-end success rate of roughly 90%. That's not a rounding error. That's one in ten workflows failing in ways that may not be immediately visible. As O'Reilly's recent analysis of multi-agent system reliability puts it: "Once agents are wired together without validation boundaries, risk compounds. Even strong models with a 98% per-agent success rate can quickly degrade overall system success to 90% or lower." (O'Reilly Radar, Feb 2026)

The typical response to this observation is "our agents are more reliable than 98%." Maybe. But that calculation also ignores the transient infrastructure failures that affect every agent: rate limit hits, network blips, upstream API timeouts, and malformed outputs that break downstream schema expectations. These failures don't show up in model benchmarks. They show up in production at 3 AM.

The good news is that all of these failure modes are solvable — not by making agents smarter, but by building systems around them that absorb and recover from failure gracefully. That's what the following four patterns do.

Pattern 1: Retry with exponential backoff and jitter

The most common agent failure type is transient: a rate limit, a timeout, a momentary service hiccup. The right response to a transient failure is to wait and try again — but not immediately, and not in a way that hammers the failing service and makes things worse.

Exponential backoff means each retry waits progressively longer than the last. A common starting configuration for AI API calls in 2026 is a base delay of 0.5 seconds, a multiplier of 2.0, and a maximum wait of 30 seconds — so retries happen at roughly 0.5s, 1s, 2s, 4s, 8s, 16s, 30s before giving up. (dasroot.net, Feb 2026)

Jitter adds a small random offset to each delay. This is critical when multiple agent instances are running in parallel: without jitter, they all retry at the exact same moment, which is a thundering herd problem that makes the service failure worse. With jitter, retries spread out across a window and the load distributes naturally.

What to configure per retry attempt:

Max retries: 3–5 for most agent tasks. More than 5 is usually a sign the failure isn't transient.
Retry-eligible errors: 429 (rate limit), 503 (service unavailable), 504 (gateway timeout), network errors. Do not retry 400-level client errors like 400 (bad request) or 401 (auth failure) — those won't resolve on retry.
Timeout per attempt: Set explicit per-call timeouts. A hung agent waiting indefinitely for an LLM response will block your entire pipeline. 30–60 seconds per LLM call is a reasonable starting point; adjust for your task type.
What to log: Each retry attempt should log the error type, attempt number, wait duration, and a task identifier so you can trace retry storms after the fact.

Most popular orchestration frameworks and HTTP clients have retry middleware you can configure. The important thing isn't which library you use — it's that you configure it explicitly rather than accepting the defaults, which are often either too aggressive (immediate retry) or too conservative (no retry at all).

Pattern 2: Circuit breakers

Retry logic handles isolated transient failures. Circuit breakers handle the scarier scenario: a dependency that's genuinely degraded and isn't going to recover in the next thirty seconds.

The circuit breaker pattern (originally from Michael Nygard's Release It!) works by tracking failure rates over a rolling window. It operates in three states:

Closed — normal operation; failures are counted but requests pass through
Open — triggered when failures exceed a threshold (e.g., 5 failures in 60 seconds); requests fail immediately without calling the dependency
Half-Open — after a cooldown period, the breaker allows a small number of test requests through to check if the service has recovered; success closes it again, failure re-opens it

For AI agent pipelines, this matters in a specific way: when an LLM provider is having a degraded incident, continuing to send requests doesn't just waste money — it fills your pipeline with failures that corrupt downstream state. A circuit breaker lets you fail fast and cleanly, route to a fallback (a cheaper model, a cached response, a human-in-the-loop queue), and recover automatically when the service returns.

The NIST AI Risk Management Framework update in 2025 specifically calls out circuit breakers as a mandatory control for agentic systems: organizations should "implement circuit breakers that automatically cut off an agent's access if it exceeds token budgets or attempts unauthorized API calls." (CSO Online, Feb 2026) The framing there is security, but the operational benefits are identical.

Practical threshold to start with: Open the circuit after 5 consecutive failures or a 50% error rate over a 60-second window. Set the half-open cooldown to 30–60 seconds. Adjust based on your observed failure patterns after the first month in production — the right numbers depend on your workload and providers.

Pattern 3: Dead letter queues for failed agent tasks

Some failures aren't transient and aren't recoverable with retries. The task might require data that doesn't exist. The upstream output might be malformed in a way the retry won't fix. The agent might have hit a structural limitation in the prompt or context window.

When retries are exhausted and the circuit is open, what happens to the task? In most naive implementations: it disappears. The agent logs an error, moves on, and the task is lost. A dead letter queue (DLQ) is the fix.

A DLQ is simply a separate queue or store where failed tasks land after all retry attempts are exhausted. The key properties:

Durability: tasks in the DLQ persist until explicitly acknowledged or deleted — they don't evaporate
Metadata preservation: store the original task payload, the error that caused failure, the number of retry attempts, and a timestamp
Replayability: you should be able to inspect a DLQ item, fix the root cause (update the prompt, correct the upstream data, adjust the agent config), and requeue it for processing
Alerting: DLQ depth should be a monitored metric; a growing DLQ is an early warning sign of a systemic problem

Most message queue systems (AWS SQS, RabbitMQ, Azure Service Bus, Google Pub/Sub) have native DLQ support. If you're not using a queue-based architecture, you can implement a lightweight version with a database table: write failed tasks to a failed_jobs table with status, payload, error, and retry count. A nightly review process (automated or human) triages items in the table.

The human review workflow is the part most teams skip. A DLQ without a triage process is just a graveyard. The right model is: failed tasks in DLQ → alert triggers → on-call review → either fix-and-replay or acknowledge-and-skip with documented reason. This closes the loop and turns silent failures into visible, actionable incidents.

Pattern 4: Idempotent agent actions

Retry logic assumes it's safe to run a task more than once. For many agent actions, that assumption is wrong by default.

If your agent sends an email, creates a CRM record, or posts a social update — and then retries due to a network timeout that actually succeeded on the first attempt — you have a duplicate. Multiply that across a pipeline that retries aggressively and you have a mess that's expensive to clean up and embarrassing if customers see it.

Idempotency means designing your agent actions so that running them multiple times with the same input produces the same result as running them once. The standard approach:

Idempotency keys: generate a unique key for each task (a hash of the input, or a UUID assigned at task creation). Pass this key to downstream APIs and stores. Most modern APIs (Stripe, Twilio, many others) accept an Idempotency-Key header and deduplicate on it server-side.
Check-before-write: before creating a record, check if one already exists with the same identifier. If it does, skip creation and return the existing ID. This is slower but works with systems that don't support idempotency keys natively.
Write-once intermediate state: if your agent has multiple steps that each produce side effects, use a state store (Redis, a DB row) to track which steps have completed for a given task ID. On retry, skip completed steps. This is sometimes called "saga pattern" in distributed systems literature.

Not every action needs to be idempotent — read-only operations are inherently safe to retry. The ones that matter are anything that writes state: API POST calls, database writes, notification sends, file creation, webhook triggers. Audit your agent's tool calls and classify each one: safe to retry freely, safe to retry with deduplication, or requires human review before retry.

Putting it together: a minimal resilience stack

You don't have to implement all four patterns simultaneously. Here's a practical sequencing based on impact-to-effort ratio:

Start with retry + backoff. This is the highest-leverage change and can usually be added to existing pipelines in an afternoon. Configure it for every external API call your agents make, including LLM provider calls. Add jitter. Set explicit timeouts.
Add a DLQ. Even a simple database table is dramatically better than silent task loss. Wire your retry exhaustion handler to write failures there. Set up a daily alert if DLQ depth exceeds zero.
Audit for idempotency. Walk through every agent action and classify it. Add idempotency keys or check-before-write logic to any action that creates external state. This is a refactor, not a configuration change — schedule it deliberately.
Add circuit breakers. Once you have production data on which dependencies fail most often, implement circuit breakers for the top two or three. Start with the LLM provider calls and any external enrichment APIs.

Once all four are in place, you'll also want a resilience dashboard: DLQ depth, circuit breaker state per dependency, retry rate by agent and error type, and mean time to recovery when failures occur. The logging infrastructure from earlier posts (AgentOps, Langfuse, or a homegrown setup) gives you the data — these metrics are just the aggregated view of it.

What about multi-agent pipelines specifically?

Each of the above patterns applies at the individual agent level. In multi-agent systems, there's an additional concern: error propagation. When Agent A produces a malformed output and Agent B consumes it without validation, the failure mode is a silent corruption that may not surface until Agent D or E — at which point the blast radius is large and debugging is painful.

The O'Reilly analysis describes using a "judge agent" at pipeline boundaries to verify and compare outputs before passing them downstream — effectively a circuit breaker for semantic quality, not just infrastructure availability. (O'Reilly Radar, Feb 2026) CIO.com's recent piece on agentic AI in 2026 echoes this: robust architectures require "circuit breakers and comprehensive audit trails from the ground up" specifically because the failure modes compound at each agent boundary. (CIO, Feb 2026)

The practical version of this doesn't require a dedicated judge agent. A validation step at each agent boundary — a schema check, a confidence score threshold, a sanity check on output length or format — catches most corruption before it propagates. The interrupt pattern (covered in a previous post) is the design-level solution; the circuit breaker is the runtime enforcement mechanism.

The mindset shift: design for failure, not against it

The common instinct when building agent pipelines is to invest in making them more reliable upfront — better prompts, more capable models, more thorough testing. All of that matters. But it doesn't change the fundamental reality of distributed systems: failures happen, often in ways you didn't anticipate, on schedules you can't control.

The teams running the most operationally stable AI automation in 2026 aren't running the most sophisticated agents. They're running agents with deliberately mediocre tolerance for ambiguity and robust infrastructure around them. They expect their agents to fail occasionally, and they've built systems that recover gracefully when they do.

That's not pessimism. That's production engineering.

Frequently Asked Questions

What is exponential backoff and why should I use it for AI agent retries?

Exponential backoff is a retry strategy where each successive retry waits progressively longer than the last — for example, 0.5s, 1s, 2s, 4s — rather than retrying immediately or at fixed intervals. It prevents your agent from hammering a degraded API and making the problem worse. For AI API calls specifically, adding random jitter (a small randomized offset) to each delay is important to prevent multiple parallel agents from retrying in synchronized waves. A recommended 2026 starting configuration is base delay 0.5s, multiplier 2.0, max delay 30s, with ±25% jitter per attempt.

What is a dead letter queue and how does it apply to AI agent pipelines?

A dead letter queue (DLQ) is a separate store where tasks land after all retry attempts are exhausted — rather than being silently dropped. In an AI agent context, a DLQ preserves the original task payload, the error that caused failure, and retry metadata, so the failure is visible and actionable instead of invisible. A DLQ without a triage process is just a graveyard; the real value comes from pairing it with a monitoring alert on DLQ depth and a workflow for human review, root cause analysis, and replay or acknowledgement of each failed item.

Why does reliability degrade so fast in multi-agent pipelines?

In a multi-agent pipeline, each agent's success probability multiplies together to produce an end-to-end success rate. If five agents each succeed 98% of the time independently, the chain succeeds roughly 90% of the time overall — one in ten workflows fails without any individual agent being unreliable. This compound failure effect is why fault tolerance at the system level (retries, circuit breakers, validation boundaries) matters more than marginal improvements to individual agent quality. Adding a sixth or seventh agent to a chain without fault tolerance makes this math substantially worse.

What is idempotency in the context of AI agents, and which actions need it?

An idempotent action is one that can be safely run multiple times with the same input and produce the same result — so retrying it after a network timeout doesn't create duplicate side effects. In agent pipelines, the actions that require idempotency design are any that write external state: API POST calls, database record creation, email sends, webhook triggers, file creation. Read-only operations are inherently safe to retry. The most practical implementation is to generate a unique task ID at creation time, pass it as an idempotency key to downstream APIs, and check for existing records before creating new ones in systems that don't support native deduplication.

How does a circuit breaker work for AI agent infrastructure?

A circuit breaker monitors failure rates for a dependency (an LLM provider, an enrichment API, a database) and "opens" when failures exceed a configured threshold — causing subsequent requests to fail immediately instead of waiting for timeouts. After a cooldown period, it enters a "half-open" state that allows a small number of test requests through; success closes the breaker, failure reopens it. For AI pipelines, circuit breakers prevent cascading failures from a single degraded upstream service from corrupting your entire workflow, and they reduce cost by not burning tokens on requests that will fail anyway. A reasonable starting configuration is: open after 5 consecutive failures or 50% error rate in 60 seconds; 30-second cooldown before half-open.

What monitoring metrics should I track for AI agent pipeline reliability?

The five metrics that give you a complete picture of pipeline health are: retry rate (by agent and error type — a spike signals a dependency problem), DLQ depth (tasks that exhausted retries — anything above zero deserves a look), circuit breaker state (which dependencies are currently open or half-open), end-to-end task completion rate (the pipeline-level success rate, not just per-agent), and mean time to recovery (how long from a failure event to successful retry or human resolution). Logging prompt version IDs and model parameters alongside these metrics makes post-incident root cause analysis significantly faster.

Sources:

If your AI agent pipeline doesn't have retry logic, a DLQ, or circuit breakers — you're running it on hope. Supergood helps teams build the resilience layer that keeps agentic workflows stable in production: failure recovery, observability, and the systems design that makes the whole thing auditable. Start the conversation at supergood.solutions or reach out on LinkedIn.