Case Study Thursday

We Rolled Back Our Agent to a Workflow — And It Was the Right Call

Published April 23, 2026 — 6 min read

TL;DR: A team replaced a finicky LLM agent with a boring deterministic workflow and a *narrow* LLM call at one step. Latency dropped ~70%, monthly spend fell by an order of magnitude, and the on-call rotation stopped paging at 3am. The contrarian lesson: "agentic" is a design choice, not a default, and the teams winning with AI in 2026 are the ones comfortable ripping out an agent when a workflow does the job.

Key Insight

The industry pushed "agents" as the next default architecture the moment tool-use matured. That framing collapsed a useful distinction: an agent decides its own control flow; a workflow has control flow you wrote, with LLM calls at specific steps. Anthropic's own 2024 essay on "Building effective agents" makes this explicit — most production value lives in workflows, not agents, and teams should reach for an agent only when the task genuinely needs dynamic planning.

The contrarian take: rolling back from an agent to a workflow is a promotion, not a regression. It means you now understand the task well enough to encode its structure. That's a win. Most teams resist the rollback because it feels like admitting the agent "failed" — but an agent that taught you the shape of your problem did its job.

Why Teams Miss This

Three patterns we keep seeing:

"Agent" became a status symbol. Leadership funds "agent" projects, not "workflow with two LLM calls" projects. So teams paper over a deterministic pipeline with a ReAct loop, pay a 4–8x token tax for the reasoning trace, and call it an agent. Then it flakes in production because the task never needed dynamic planning.

2. Eval sets hide the decision problem. The team's evals measure output quality on a fixed task distribution. They don't measure: does the LLM actually need to decide the order of operations here? On 90%+ of enterprise tasks, the order is stable. You're paying an LLM to re-derive it every run.

3. Rollback feels like retreat. Engineers who shipped the agent are the ones being asked to delete it. That's culturally hard. Leadership doesn't help when the quarterly OKR says "ship agentic workflows." The fix is naming the pattern out loud: "We ran the agent in shadow mode, learned the task shape, now we're hardening it as a workflow."

How to Actually Do It

Here's the decision framework we've seen work in three separate rollbacks:

Step 1 — Log the agent's decision trace in production for 2 weeks. Every tool call, every order, every branch. Don't change behavior, just capture the DAG the agent executed on each run.

Step 2 — Cluster the traces. If 85%+ of runs follow one of ≤5 distinct paths, you have a workflow. If the long tail of "unique paths" is a real 30%+ of traffic, you genuinely have an agent problem — keep it.

Step 3 — Encode the top paths as deterministic workflow steps. Python, a state machine, a DAG framework (Temporal, Prefect, LangGraph in workflow mode), whatever fits. Keep LLM calls at the ambiguous steps only — classification, extraction, generation. Everything else is code.

result = agent.run(user_request, tools=[search, fetch, summarize, email])

intent = classify(user_request) # 1 LLM call, structured output

docs = search(intent.query) # deterministic

filtered = [d for d in docs if d.score > 0.7]

summary = summarize(filtered) # 1 LLM call

if intent.requires_email:

send_email(summary, intent.recipient) # deterministic, audited

Step 4 — Keep the agent alive in shadow mode on the long tail. The 15% of weird requests that don't match the top paths still route to the agent. You now have two systems: a cheap reliable workflow doing most of the traffic, and an agent as a fallback. Over time, new clusters in the long tail can be promoted into the workflow.

Step 5 — Measure the rollback honestly. Track p50/p95 latency, cost per 1K requests, and the on-call page rate. In the three cases we've seen: latency dropped 50–80%, cost dropped 6–12x, and incidents dropped by half or more. If your numbers move less than that, the rollback was still worth it for the observability alone — workflow steps are easier to debug than an agent's reasoning chain.

What We've Learned

The "agent vs workflow" choice isn't a one-time decision; it's a dial you adjust as you learn the task. Ship an agent when you're genuinely uncertain about control flow. Roll it back to a workflow once the traces stabilize. Be willing to re-promote to an agent if requirements shift. Most teams miss the middle step.

The uncomfortable observation: a lot of what's marketed as "agentic AI" in 2026 is a workflow wearing a ReAct costume. That's fine as a prototype. It's expensive and flaky as a production system.

Next experiment: Pick one "agent" you've shipped in the last 6 months. Pull two weeks of its tool-call traces. Plot path frequency. If the top 3 paths cover more than 80% of runs, you have a candidate for rollback — and a ~week of engineering work to reclaim most of its cost and latency.

Sources

FAQ

Q: Isn't rolling back to a workflow just "not using AI"?

A: No — the LLM is still doing the work it's best at (classification, extraction, generation). You're removing the LLM from control flow, which is the expensive, unreliable part. The AI still runs; it just stops re-deciding what to run next on every request.

Q: How do I know if my task actually needs agentic control flow?

A: Log 2 weeks of traces and cluster them. If the top 5 paths cover 85%+ of runs, you have a workflow. If runs look genuinely different each time (novel tool combinations, unpredictable branching), keep the agent. "Feels complex" is not the same as "is dynamically structured."

Q: Won't we lose the agent's ability to handle edge cases?

A: Keep the agent as a fallback. Route the common paths (80–90% of traffic) to the workflow and the long tail to the agent. You get the reliability of the workflow and the flexibility of the agent without paying agent prices on every request.

Q: What frameworks support the workflow-with-LLM-steps pattern well?

A: LangGraph (workflow mode), Temporal, Prefect, Airflow with LLM operators, and AWS Step Functions all work. If you're early, plain Python with a state machine is fine — the framework matters less than having clear, named steps with typed inputs and outputs.

Q: How do I sell "rolling back the agent" to leadership that funded "agentic AI"?

A: Reframe it. You didn't roll back — you promoted the agent's learned behavior into a hardened workflow. The agent is still running on the ambiguous long tail. Numbers help: lead with the cost, latency, and incident reductions. "We made the agent 10x cheaper to run" lands differently than "we replaced the agent."

Q: Does this mean multi-agent architectures are also overhyped?

A: For most enterprise tasks, yes. If a single agent didn't need dynamic planning, a committee of them won't either — you've just multiplied the token bill. Multi-agent starts to pay when you have genuinely different specializations that need to negotiate, which is rarer than conference talks suggest.