We Wired Two AI Agents Together. Here's What Kept Breaking at the Handoff.
A real-world case study on building a two-agent review pipeline: four failure modes we hit at the seam between agents, the typed-state contract that fixed them, and a pre-build checklist for any team designing multi-agent workflows.
Single-agent failures are annoying. You ship, something breaks, you find it in the logs, you fix the prompt, you redeploy. Painful, but traceable.
Multi-agent failures are a different category of problem entirely. When two agents are wired together and something goes wrong, the failure often doesn't happen where either agent actually ran — it happens between them. In the seam. In the thing nobody thought to design carefully because everyone was focused on what the agents would do individually.
This is the story of what we learned building a two-agent pipeline for a client's proposal review workflow. The agents themselves worked fine. The handoff between them was the part that kept breaking.
Details below are drawn from a composite of real client engagements. Specifics are generalized.
We built a two-agent pipeline where Agent 1 drafted enriched proposals and Agent 2 reviewed and scored them. The individual agents performed well in isolation. In production, we hit four distinct failure modes — all at the handoff boundary: format mismatches, context loss, conflicting ground truth assumptions, and silent partial failures that looked like successes. The fix was a typed state schema that both agents agreed to upfront, explicit handoff payloads with required fields, and observability instrumented specifically at the seam. Multi-agent systems fail at the seams, not the centers. Design the contract between agents as carefully as you design the agents themselves.
The setup: a two-agent proposal review pipeline
The client was a professional services firm — about 60 people — that responded to a large volume of RFPs and client briefs. Their proposal process was bottlenecked at review: a senior consultant would draft, then two other people would review for completeness, tone, and alignment with the client's stated criteria. That review cycle averaged two days per proposal and was the single biggest drag on their capacity to respond to new opportunities.
The proposed solution: a two-agent pipeline to handle first-pass review automatically.
- Agent 1 (the Enricher): Takes the raw proposal draft and a brief with client criteria. Enriches the draft — identifies gaps, adds supporting data points, flags thin sections, and produces a structured output with the enriched text plus an annotated gap list.
- Agent 2 (the Reviewer): Takes the Enricher's output and the original client brief. Scores the proposal across five dimensions (completeness, tone, specificity, criteria alignment, competitive differentiation). Produces a structured review with scores, specific improvement notes, and a go/no-go recommendation.
A human reviewer would then look at the Reviewer's output and decide whether to approve, revise, or escalate. The goal was to compress that two-day review cycle to two hours — with the human's attention focused on the Reviewer's scorecard rather than reading every word of the draft.
↓
Agent 1: Enricher (enriched text + gap list)
↓ handoff payload
Agent 2: Reviewer (scorecard + recommendation)
↓
[Human Review Queue]
In testing, it worked beautifully. We ran it against 30 historical proposals — ones where we already had human review notes — and the Reviewer's scores correlated well with the human reviewers' assessments. The stakeholders were happy. We went to production.
Two weeks later, the human reviewers were confused about what the agents were actually doing. Four failure modes had crept in, and none of them were obvious until we started digging.
Failure Mode #1: Format drift at the handoff boundary
Agent 2 was reading a different shape of output than Agent 1 was producing
Agent 1 was prompted to produce a structured output with two top-level keys: enriched_draft (the full text) and gap_list (an array of gap objects, each with a section, a severity, and a description). Agent 2 was prompted to read that structure and use the gap list as input to its scoring.
What we hadn't accounted for: LLM outputs are not deterministic. On some proposals, Agent 1 returned gaps instead of gap_list. On others, it returned gap_list as a flat string instead of an array. When severity was not immediately obvious from the text, it sometimes omitted the severity field entirely.
Agent 2, reading this variable-shape output, would silently adapt — sometimes reading the gaps key, sometimes failing to find either and defaulting to "no gaps identified." The Reviewer was scoring proposals as complete when the Enricher had flagged significant gaps that simply didn't survive the handoff in a readable format.
This is the single most common multi-agent failure mode we've seen. When Agent A's output is Agent B's input, and that contract is defined only in natural language inside each agent's prompt, drift is inevitable. LLMs are probabilistic. They won't produce the exact same field names, nesting structures, or data types on every run — especially when the content varies.
A typed state schema, defined once and enforced at both ends
We defined a single shared state schema — a Pydantic model — that both agents were required to validate against. Agent 1's output was parsed through the schema before being passed to Agent 2. If the parse failed, the run stopped and an error was logged. Agent 2 received a guaranteed-shape object, not a raw LLM string.
The schema also served as explicit documentation: every field, its type, whether it was required or optional, and a one-line description. Both agent prompts referenced it directly: "Your output must conform to the EnricherOutput schema. Required fields: enriched_draft (string), gap_list (array of GapItem objects). Do not invent field names."
- Schema validation failures go to a separate error queue — not the human review queue
- The Enricher's prompt includes the full schema definition, not just field names
- Schema version is logged with every run; any schema change triggers a re-test pass
Failure Mode #2: Context loss — Agent 2 didn't know what Agent 1 knew
The Reviewer was scoring drafts without the context the Enricher had used to enrich them
The Enricher had access to the original client brief, the draft, and a context document with firm-specific positioning data. It used all three to produce its enriched output. The handoff payload we passed to the Reviewer, however, was just the enriched draft and the gap list.
The Reviewer had the enriched draft — but not the original brief or the positioning context. It was scoring alignment with "client criteria" that it had to infer from the draft text, because we hadn't passed the actual criteria document. On proposals where the Enricher had done a lot of work — reframing language, adding specifics — the Reviewer would score the criteria alignment as "excellent" based on the enriched text, even when the original brief had criteria that the draft still didn't address.
The Reviewer was being asked to evaluate alignment with a target it couldn't see.
Explicit context forwarding in the handoff payload
We expanded the handoff schema to include a context object: the original client brief (not just the draft), the positioning context document reference, and a enricher_notes field — a structured summary of what the Enricher had changed and why.
The enricher_notes field turned out to be more valuable than we expected. It gave the Reviewer explicit visibility into the Enricher's reasoning — "I strengthened the competitive differentiation section based on the client's stated preference for outcome-oriented language" — which let the Reviewer evaluate the quality of the enrichment, not just the quality of the resulting text.
This is a pattern we now call the annotated handoff: the first agent passes not just its output, but a summary of its reasoning and the key decisions it made. The second agent can use that reasoning as context for its own evaluation, rather than reverse-engineering it from the output text alone.
Failure Mode #3: Conflicting ground truth — two agents, two sets of facts
The Enricher and Reviewer were using different versions of the firm's capability statements
The firm's positioning context — the document describing their capabilities, differentiators, and client references — was stored in a knowledge base. Both agents retrieved from it via a RAG step. The problem: they weren't necessarily retrieving the same chunks.
The Enricher might retrieve and use a client reference from Q4 2025. The Reviewer, doing its own retrieval pass, might retrieve a different (or contradictory) set of references. On three proposals, the Reviewer flagged the Enricher's client references as "unverified" — because the Reviewer's own retrieval hadn't surfaced the same reference. The Enricher had good data; the Reviewer just hadn't seen it.
The two agents were grounded in different subsets of the same knowledge base. Their outputs were internally consistent but externally inconsistent with each other.
Shared retrieval: the Enricher fetches, the Reviewer reads from the same payload
We changed the architecture so that the Enricher performs all knowledge base retrieval for a given proposal run. The retrieved chunks are included in the handoff payload under a retrieved_context key. The Reviewer reads from that same set of chunks — it does not perform its own retrieval pass.
This has two benefits: consistency (both agents operate from the same retrieved facts) and traceability (every run's behavior is reproducible from the same inputs). The retrieved context is logged as part of the run record, which means you can replay any past run exactly. This is the same principle that frameworks like LangGraph bake into their state model — a single shared state that flows through the graph, rather than each node fetching its own.
The Reviewer can still flag low-confidence references — but it's now flagging them because the retrieved context itself is weak, not because it pulled different context than the Enricher did.
Failure Mode #4: Silent partial success — the run looked fine but wasn't
The pipeline reported success on runs where the Reviewer had quietly skipped scoring dimensions it couldn't evaluate
The Reviewer was asked to score across five dimensions. When it couldn't confidently score a dimension — because the relevant section was missing, or the context was insufficient — it would assign a midpoint score (3 out of 5) and move on. It didn't flag this. It didn't raise an error. The run completed. The scorecard looked complete. The human reviewer would act on a 3/5 score in "competitive differentiation" not knowing that the agent had essentially abstained rather than evaluated.
This is a variant of the hallucination problem from our earlier case study — but more subtle. The agent wasn't making things up; it was quietly defaulting to a neutral value when it lacked the information to score honestly. From the outside, the output looked complete and confident.
Explicit abstention scores + confidence metadata on every dimension
We updated the Reviewer's output schema to include, for each scoring dimension: a score (1–5), a confidence field ("high", "medium", "low", or "abstain"), and a rationale string. The agent was re-prompted: "If you cannot evaluate a dimension with reasonable confidence, set confidence to 'abstain' and score to null. Do not guess. A visible abstention is more useful than a false midpoint score."
- Any dimension with
confidence: "abstain"is highlighted in the human review queue with a visual indicator - A run with more than two abstentions on a single proposal is flagged as "incomplete review" and routed to a different queue for fuller human evaluation
- The overall review report shows a "confidence distribution" — how many dimensions were scored at high, medium, low, or abstain — so reviewers can calibrate how much to trust the automated scorecard
The pattern underneath all four failures
Looking back at all four, there's a single underlying cause: we designed the agents but we didn't design the interface between them.
Every software engineer knows that APIs have contracts — explicit definitions of what goes in, what comes out, what errors look like, and what the caller can rely on. We designed two agents without writing a contract between them. We assumed the prompts would keep both sides in sync. They didn't.
The handoff is the product. In a single-agent system, you design the agent's behavior. In a multi-agent system, you design the agent's behavior and the contract between agents. If you only design the former, you'll spend your production debugging time on the latter.
The LangChain team's 2026 State of Agent Engineering report surveyed 1,300+ practitioners and found that quality — not cost, not latency — is now the top barrier to production agent deployment, cited by 32% of respondents. Our experience maps directly to this. The quality problems that block production confidence aren't in the agents' individual performance. They're in the seams between components — the parts that don't show up in a single-agent benchmark.
What the fixed pipeline actually looks like
After retrofitting all four fixes, the pipeline has now been running in production for eight weeks. Here's how it's structured:
- Retrieval step (pre-Enricher): A dedicated retrieval step pulls all relevant chunks from the knowledge base for this proposal. Output: a
retrieved_contextobject with chunk IDs and text. This step runs once and its output flows through the entire pipeline. - Enricher step: Agent 1 receives the draft, the client brief, and the
retrieved_context. It produces an output conforming to theEnricherOutputPydantic schema. Output is parsed and validated before proceeding. Parse failures go to an error queue. - Handoff payload assembly: A deterministic step (no LLM) assembles the full handoff payload: enriched draft, gap list, enricher notes, original brief, retrieved context. This step is pure data transformation — it just structures what the Enricher produced plus the original inputs.
- Reviewer step: Agent 2 receives the assembled handoff payload. It produces a scorecard conforming to the
ReviewOutputPydantic schema, with per-dimension confidence levels. Abstentions are explicit fields, not nulls. - Output routing: A deterministic routing step inspects the Reviewer's output and routes to the appropriate human queue: standard review, incomplete review (too many abstentions), or error review (schema validation failures).
The observability layer is instrumented at every step, with particular attention to the handoff boundaries — input size, output size, schema parse result, and wall-clock time are logged at each transition. When something goes wrong, the trace immediately shows which step and which field caused the issue.
The results
The human reviewers now describe their job as "reviewing the agent's review" rather than doing the review themselves. On a typical proposal, they spend about 20–30 minutes checking the Reviewer's scorecard, reading the flagged gaps, and making a final call. The two-day cycle is gone. Complex proposals still need deeper human attention — but the routing step identifies those automatically.
The more important number: before the fixes, the team had lost confidence in the pipeline after the first two weeks of confusing outputs. After the fixes, confidence recovered and the team actually uses it now. A multi-agent system that works in theory but loses user trust in practice is not a working system.
The handoff design checklist
Multi-Agent Handoff Design Checklist
- Define a typed state schema (Pydantic, TypeScript interface, or equivalent) before writing either agent's prompt — the schema is the contract
- Parse and validate every agent output against the schema before passing it to the next agent — never pass raw LLM strings across agent boundaries
- Schema validation failures route to an error queue, not the next step — a partial output is worse than a visible failure
- The handoff payload includes not just the agent's output but its reasoning notes ("annotated handoff") — downstream agents shouldn't reverse-engineer upstream logic from output text
- Shared retrieval: if multiple agents read from the same knowledge base, do the retrieval once and pass the chunks through the state — don't let each agent retrieve independently
- Every scoring or evaluation output includes a confidence level per dimension — explicit abstention is always better than a false neutral score
- Observability is instrumented at the seam, not just inside each agent — log input size, output size, schema parse result, and latency at every handoff boundary
- A deterministic routing step (no LLM) handles the output dispatch — don't ask an agent to decide where its own output should go
- The pipeline is tested with adversarial inputs: missing fields, wrong types, truncated outputs, and API timeouts at each step
- Schema changes require a re-test pass against a regression set — you can't just update one agent's prompt without checking the downstream agent still reads the new structure correctly
A note on frameworks: the patterns above work regardless of which agent framework you use. LangGraph makes some of this easier through its explicit shared state model and durable execution — failures can be replayed from exact stopping points rather than restarted from scratch, which is enormously useful for debugging handoff failures. For the broader pattern of how to handle retries, circuit breakers, and dead letter queues when those failures do occur, see our post on agent failure recovery. CrewAI takes a different approach with role-based agents passing structured task outputs, which also enforces some of this discipline implicitly. But the underlying principle — define the contract, validate the boundary, instrument the seam — is framework-agnostic.
What this means for your next multi-agent build
If you're building a two-agent or multi-agent pipeline, here's the one thing to internalize before you write a single line of prompt: the hardest part will not be making each agent work — it will be making them work together.
Individual agents are hard. But they're hard in ways that have well-understood fixes: better prompts, retrieval tuning, output constraints, evals. Multi-agent handoff failures are harder because they don't announce themselves. The run completes. The output looks plausible. The problems surface downstream — when a reviewer acts on a score the agent actually abstained on, when a gap list that should have been an array arrives as a string, when two agents operating on different retrieved context reach contradictory conclusions neither of them flags.
The investment is worth it. A well-designed two-agent pipeline can do things that no single agent can — parallel reasoning, specialized expertise, built-in self-review. But the investment is in the seam, not just the agents. Design the contract first. Build the schema before the prompts. Instrument the boundary as carefully as you instrument the steps.
Your human reviewers — the ones who have to trust the output — will thank you.
Frequently Asked Questions
What is an agent handoff in a multi-agent AI system?
An agent handoff is the point where one AI agent passes its output to another agent as input. Unlike a simple function call, the handoff in a multi-agent system involves transferring not just data but context, reasoning state, and any retrieved information the downstream agent needs to continue the task. Poorly designed handoffs — where the output format or content doesn't match what the downstream agent expects — are one of the leading causes of quality failures in production multi-agent pipelines.
Why do multi-agent AI systems fail at the handoff between agents?
LLM outputs are non-deterministic, meaning the same prompt can produce slightly different output structures across runs — different field names, different nesting, different data types. If Agent A's output is Agent B's input and there's no enforced schema between them, format drift accumulates over time. Downstream agents adapt silently, filling gaps with defaults or assumptions rather than raising errors. The result is a pipeline that looks like it's working while producing quietly incorrect outputs.
What is a typed state schema in the context of AI agent pipelines?
A typed state schema (commonly implemented with Pydantic in Python or TypeScript interfaces in JS/TS) is a formal definition of the data structure that flows between agents in a pipeline. It specifies every field, its type, whether it's required or optional, and validation constraints. Agent outputs are parsed against the schema before being passed to the next agent — if the parse fails, the run stops and an error is logged rather than passing malformed data downstream. This is the most effective single fix for multi-agent format drift failures.
What is the "annotated handoff" pattern for multi-agent AI systems?
The annotated handoff is a pattern where the first agent passes not just its output but a structured summary of its reasoning and key decisions alongside the output data. For example, an Enricher agent might include a field like enricher_notes explaining which sections it strengthened and why. The downstream Reviewer agent can then evaluate the quality of the enrichment process, not just the final text — which produces more accurate and actionable reviews. It also significantly helps with debugging, since the upstream agent's reasoning is explicit rather than having to be reverse-engineered from the output.
How should I instrument observability for a multi-agent AI pipeline?
Standard agent observability (tracing LLM calls, logging token counts, timing steps) is necessary but not sufficient for multi-agent systems. You also need to instrument specifically at the handoff boundaries: log input size and output size at each agent boundary, record schema parse results (success/failure and which fields failed), and track per-dimension confidence levels if your agents produce scored outputs. Tools like LangSmith, Arize Phoenix, and Braintrust all support multi-step trace visualization; the goal is to make the seam between agents as visible as the agents themselves.
Should multiple agents in a pipeline each retrieve from the knowledge base independently?
Generally no — if multiple agents in a pipeline need to reason about the same set of facts, it's safer to retrieve once (typically before the first agent step) and pass the retrieved chunks as part of the shared state through the entire pipeline. Independent retrieval by each agent introduces divergent grounding: two agents may retrieve different subsets of the knowledge base and reach contradictory conclusions neither of them flags as inconsistent. Single-retrieval pipelines are also more reproducible, since the same inputs will produce the same retrieved context and therefore more consistent behavior across runs. See the LangGraph documentation on shared state for a framework-level implementation of this pattern.
Sources
- LangChain — State of Agent Engineering 2026 (survey of 1,300+ practitioners)
- AWS Blog — "Evaluating AI Agents: Real-world lessons from building agentic systems at Amazon," February 2026
- Zircon.tech — "Agentic Frameworks in 2026: What Actually Works in Production," February 2026
- LangGraph Documentation — Graph API, shared state, and durable execution
- Galileo — "How to Debug AI Agents: 10 Failure Modes + Fixes," October 2025
- ByteBridge — "From Human-in-the-Loop to Human-on-the-Loop: Evolving AI Agent Autonomy," January 2026
Designing a multi-agent pipeline and want to stress-test the seams before you hit production? That's exactly the kind of architecture review we do at Supergood Solutions. Drop us a line and let's map out your handoff contracts before the agents find the gaps for you.