Manual Work Monday · Workflows

Shipping AI Agents Without Evals Is Just Shipping Bugs (Here’s the Practical Fix)

“It worked in testing” doesn’t mean anything when your system is probabilistic, tool-connected, and one Slack message away from doing something creative in production.

Published February 23, 2026 — 8 min read

Here’s the uncomfortable truth: most teams don’t ship AI agents. They ship unobserved behavior and hope the logs will be “good enough” when something breaks.

If that sounds dramatic, consider what an agent actually is:

That is not a workflow. That is production software with a personality.

The goal isn’t “perfect answers.” It’s predictable failure.

Classic automations fail like clocks: the condition didn’t match, the API 500’d, a field changed. Annoying, but understandable.

Agents fail like interns: they misunderstand, overreach, get tricked by untrusted inputs, and occasionally take the longest possible path to the right result.

So the question isn’t “how do we stop failures?” The question is:

This is exactly the framing in modern AI risk guidance: risk management is continuous and spans the lifecycle — not something you do once at launch (NIST AI RMF).

Step 1: define “good” with a tiny eval set (not a thesis)

You don’t need 10,000 labeled examples to start. You need 25–100 cases that represent your real world:

For marketing ops, a starter set might include:

Then you choose the metric that actually matters. Examples:

Open-source frameworks exist for this (for example, OpenAI Evals), but the tooling choice is not the point. The point is: write down what “pass” means.

Step 2: turn evals into a release gate (yes, even for “just prompts”)

Most teams treat prompts like content. That’s how you end up with prompt edits shipping on vibes.

A better pattern:

  1. Every prompt / policy / tool schema change runs the eval set
  2. Results are stored with a version (date + hash + owner)
  3. Production only updates if the run meets thresholds

This is boring DevOps. Good. You want boring.

Minimum viable thresholds

Step 3: add “shadow mode” before you give write access

If your agent currently writes to production systems on day one, you’re doing a trust fall with your CRM.

Shadow mode means:

After a week, you’ll have the most valuable artifact in agent ops: a dataset of real inputs + real outputs + what humans would have done. That becomes your eval set v2.

Step 4: instrument the agent like production software (because it is)

If you can’t answer “what happened?” in under 60 seconds, you don’t have an agent. You have a mystery.

What to log (every run)

Standardization helps here: OpenTelemetry defines semantic conventions so telemetry uses consistent names across libraries and platforms (OpenTelemetry).

Step 5: treat prompt injection like a real incident class

Prompt injection isn’t a theoretical “AI risk.” It’s just input manipulation — and agents are input-powered.

OWASP maintains a dedicated Top 10 list for LLM applications because the failure modes aren’t the same as classic web apps (OWASP Top 10 for LLM Applications (2025)).

Practically, for ops teams, this means:

Step 6: put cost + rate limits on the agent (or it will explore)

Agents don’t just answer. They try things. That’s what makes them useful — and expensive.

Minimum guardrails:

The “Monday checklist” (what to implement this week)

  1. Create a 50-case eval set from real tickets / requests
  2. Define pass/fail thresholds (accuracy + safety + cost)
  3. Run evals on every prompt/policy/tool change
  4. Deploy in shadow mode for 7 days before granting write access
  5. Add run IDs + tool-call logs + cost/latency metrics
  6. Add an alert for “forbidden tool call attempted”

Do those six things and you’ll be ahead of 90% of teams “doing agents.”

Sources:

If you want help turning “agent demos” into production-grade automations (evals, guardrails, and observability included), tell me what your agent touches: CRM, email, ads, or data warehouse. I’ll tell you the safest first cut.