Strategy Saturday · AI Tooling

Stop Using One Model for Everything: A Practical Guide to AI Model Routing

Your AI stack doesn't need a single "best" model — it needs a router. Here's how to intelligently dispatch requests across multiple LLMs based on cost, complexity, and compliance, without rebuilding your stack every time a better model ships.

Published March 7, 2026 — 10 min read

TL;DR

Most teams send every AI request to one premium model, then complain about the bill. Model routing — automatically dispatching each request to the most appropriate LLM based on complexity, cost, and latency requirements — can cut token costs by 30–70% while maintaining quality. IDC forecasts that by 2028, 70% of top AI-driven enterprises will use multi-model routing architectures. The tooling to do this today is mature: LiteLLM, Portkey, RouteLLM, and Not-Diamond all ship production-ready routing. The catch is that routing adds a new layer of complexity — and if you don't instrument it properly, you'll create new failure modes faster than you fix old ones.

The One-Model Trap

When teams first ship an AI feature, they pick a model, wire it in, and ship. It works. Then the bill comes. Then the model gets deprecated. Then a new capability they need only exists in a different provider's API. Then a compliance requirement shows up saying certain data can't leave a specific cloud region.

By the time most teams realize they need flexibility, they've hardcoded model names across dozens of call sites. Switching anything requires a project.

The root problem is that no single model is optimal for every request. A simple "summarize this paragraph" task doesn't need the same model as a multi-step reasoning pipeline that writes and validates a financial analysis. Treating them the same way burns money and creates unnecessary latency.

The answer isn't to find a better single model. It's to build a routing layer — a traffic controller that looks at each incoming request and dispatches it to the right model for that specific job.

Why Model Routing Is Now a First-Class Architecture Decision

Two years ago, routing was a nice-to-have. Now it's structural. A few forces are converging:

The model landscape exploded. You no longer have three viable options. You have GPT-5 variants, Claude Sonnet/Opus/Haiku, Gemini Pro/Flash/Nano, DeepSeek R2, Llama 4, Mistral, and dozens of fine-tuned vertical models. Each has a different cost profile, latency, and capability ceiling.
Reasoning tokens are expensive. Chain-of-thought and extended thinking modes can inflate your monthly token bill by 3x on the same query volume, according to Company of Agents' 2026 analysis. If your agent uses a reasoning model for tasks that don't need it, you're paying a premium for nothing.
Enterprise LLM spend is real money now. Enterprise LLM spending hit $8.4 billion in the first half of 2025, with nearly 40% of enterprises spending over $250,000 annually on language models. At that scale, a 30% reduction through intelligent routing saves $75,000+ per year.
Compliance requirements are fragmenting the landscape. Data sovereignty rules increasingly require that certain request types stay within specific cloud regions or on-premise deployments. A routing layer is how you enforce that without writing custom logic into every feature.

IDC's 2026 AI and Automation FutureScape puts it plainly: by 2028, 70% of top AI-driven enterprises will use advanced multi-tool architectures to dynamically and autonomously manage model routing across diverse models. The teams building this now are getting ahead of a structural shift that's coming regardless.

Three Routing Strategies (and When to Use Each)

1. Rule-Based Routing

The simplest approach: write explicit conditions that map request characteristics to models. Short prompt + no tool use → lightweight model. Long context + structured output requirement → capable model. Contains financial data → on-premise or compliant-region model only.

Rule-based routing is fast to implement and easy to audit. The downside is brittleness — conditions need to be maintained as your use cases evolve, and they don't generalize well to novel request types.

Best for: Teams just starting with routing. High-compliance environments where behavior must be deterministic and auditable. Stable workflows with well-understood request types.

2. Semantic Routing

Instead of matching on surface features, semantic routing embeds the incoming request and compares it against a set of pre-defined route definitions using vector similarity. A request semantically close to "generate code" gets routed differently than one close to "summarize legal document," even if neither matches a keyword rule exactly.

LiteLLM's auto-routing uses an embedding model to semantically match input against utterances you define in a YAML config. This is more flexible than keyword rules and handles ambiguous requests better — but it adds an extra embedding call per request, and the route definitions still need human curation.

Best for: Varied request types within a defined domain. Teams that can invest time in defining route categories. Workflows where "close enough" routing is acceptable and performance can be measured against ground truth.

3. ML-Based Routing

The most sophisticated approach: train a small classifier (or use a pre-trained routing model) that predicts which model will produce the best outcome for a given request, based on historical performance data. The router learns from your actual usage — which model got better evals, which finished faster, which cost less for equivalent quality.

RouteLLM (from LMSYS, the group behind Chatbot Arena) is an open-source framework for this. Their research showed that trained routing models can reduce calls to expensive frontier models by 40–75% with minimal quality degradation on standard benchmarks. Not-Diamond takes a similar approach with a managed service layer — their router is trained on real model performance data and claims up to 30% cost reduction without writing a single routing rule yourself.

Best for: Teams with sufficient request volume to generate training data. Workflows where quality is measurable and you can close the feedback loop. Production systems where you're willing to invest in routing infrastructure as a first-class component.

Practical note: Most teams start with rule-based routing and evolve toward semantic or ML-based as complexity grows. Don't over-engineer the routing layer before you've shipped — a simple YAML config with three tiers (lightweight / standard / capable) beats a fancy ML router you haven't tuned yet.

The Fallback Chain: Your Safety Net

Routing doesn't eliminate provider failures — it makes handling them systematic. A fallback chain defines what happens when a model is unavailable, rate-limited, or returns an error: automatically retry with the next model in the chain.

LiteLLM's routing-load-balancing documentation covers this well: you define a priority order of models, and the router tries them in sequence on failure. LiteLLM uses Redis to track cooldown state across deployments, so a model that just hit a rate limit doesn't keep getting hammered.

A production fallback chain for a typical ops workflow might look like this:

Primary: Gemini Flash (fast, cheap, handles most requests)
Fallback 1: Claude Haiku (different provider, same cost tier)
Fallback 2: GPT-4o Mini (third provider, still affordable)
Escalation: Claude Sonnet or GPT-4o (only if the request is complex enough to warrant it)

The escalation threshold is the key design decision. Too aggressive, and you'll route complex requests to cheap models that hallucinate. Too conservative, and you've rebuilt the one-model-for-everything problem.

The Tooling Landscape

Tool	Routing Type	Best Fit	Trade-offs
LiteLLM	Rule-based + semantic + load balancing	Python-native teams who want full control	Most flexible; highest setup overhead. Self-hosted.
Portkey	Rule-based + semantic + gateway-level routing	Teams wanting routing + observability + guardrails in one layer	Managed service. Adds a dependency; excellent dashboard.
RouteLLM	ML-based (trained classifier)	Teams with volume who want to minimize frontier model usage	Open-source, research-grade. Needs training data to tune.
Not-Diamond	ML-based (managed, pre-trained)	Teams that want ML routing without training infrastructure	Managed API. Less customizable than RouteLLM.
Helicone	Rule-based + load balancing	Observability-first teams; Rust-based performance	11 µs overhead. Better as observability gateway than pure router.
Cloudflare AI Gateway	Rule-based + caching + rate limiting	Edge-deployed apps; low-latency global routing	Best for latency optimization; less sophisticated on routing logic.

Compliance and Data Sovereignty Routing

This is the routing use case that non-technical stakeholders care most about — and it's frequently underbuilt.

Enterprises handling regulated data (healthcare, financial, legal) increasingly need guarantees that specific request types don't transit certain networks or get processed by models hosted in non-compliant regions. A routing layer is the enforcement point for these policies.

Practically, this means adding metadata to requests — a "data classification" field on each request that the router reads and uses to enforce model selection rules. A request tagged contains-PII routes to an on-premise model or a provider with a signed DPA. An untagged request gets the normal cost-optimized path.

Portkey supports this natively through its gateway config. LiteLLM supports it through custom routing callbacks. If you're building compliance routing from scratch, treat the classification tag as a first-class input, not an afterthought.

What Good Routing Instrumentation Looks Like

A routing layer you can't observe is just a black box you added to your stack. At minimum, instrument these signals:

Route distribution — what percentage of requests are going to each model? If 90% are hitting your expensive frontier model, your routing rules aren't doing their job.
Fallback rate — how often is the primary model unavailable or erroring? A rising fallback rate is an early signal of provider instability.
Quality delta by route — are requests routed to cheaper models producing measurably worse outputs? Use your eval framework to track this separately per route.
Cost per request by route — not just aggregate token cost, but the per-route breakdown. This is how you find routes that are routing wrong.
Latency by model — the fastest model isn't always the cheapest, and vice versa. Track both.

This connects directly to the AgentOps observability stack covered in last week's Tech Tuesday post — routing metrics should flow into the same tracing infrastructure as the rest of your agent pipeline.

The Decision Tree: Do You Need a Router Yet?

Not every team needs a routing layer on day one. Here's a simple filter:

Monthly LLM spend < $500: Probably don't need one yet. Pick one model, ship, optimize later.
Monthly LLM spend $500–$5,000: Rule-based routing is likely worth the 1-2 days of setup. A simple three-tier config (lightweight / standard / capable) will pay for itself quickly.
Monthly LLM spend > $5,000 or compliance requirements: Invest properly in the routing layer. ML-based routing or a managed gateway (Portkey, Not-Diamond) is worth evaluating alongside LiteLLM self-hosted.
Multi-agent architecture: You already need routing — different agents in your pipeline have different model requirements. Build the router before you wire up the agents.

The Bottom Line

The "best model" debate is the wrong frame. Different requests need different models, and the teams that recognize this early build AI infrastructure that stays cost-effective and resilient as the model landscape keeps shifting.

Start simple: a three-tier rule-based setup with LiteLLM or Portkey, a fallback chain across two providers, and instrumentation on route distribution and fallback rate. That alone will expose 80% of the optimization opportunities in most stacks.

Then, once you have volume and measurement, look at semantic or ML-based routing for the next level of precision. The infrastructure is there. The question is just whether you've built the observability foundation to use it well.

Frequently Asked Questions

What is AI model routing and why does it matter in 2026?

AI model routing is the practice of automatically dispatching each AI request to the most appropriate language model based on factors like task complexity, cost, latency requirements, and compliance constraints — rather than sending everything to a single model. It matters because the modern AI landscape offers dozens of viable models with very different cost and capability profiles, and using the same premium model for every request wastes money without improving quality. IDC forecasts that 70% of top AI-driven enterprises will use multi-model routing architectures by 2028.

How much can AI model routing actually reduce LLM costs?

Organizations using model routers commonly report 30–70% cost reductions while maintaining output quality, according to practical benchmarks from routing tool vendors like MindStudio and Portkey. For specific workloads where lightweight models handle the majority of requests, some teams achieve up to 98% cost reduction on that request category. A customer support chatbot handling 100,000 daily requests that routes 80% of queries to cheaper models can drop from ~$4,500 to ~$1,500 per month — even with the routing overhead.

What is the difference between LiteLLM and Portkey for model routing?

LiteLLM is an open-source, self-hosted Python proxy that provides load balancing, semantic routing, and fallback chains across 100+ LLM providers — it gives maximum flexibility but requires more setup and ops overhead. Portkey is a managed service that combines routing with observability, guardrails, and prompt management in one platform — it's faster to get started and has a strong dashboard, but it introduces a vendor dependency. Both are production-ready; the choice depends on whether your team prefers full control (LiteLLM) or an integrated managed layer (Portkey).

How do I handle compliance requirements in an AI model routing setup?

Compliance routing works by attaching a data classification tag to each AI request — for example, contains-PII or regulated-financial-data — and writing routing rules that enforce which models are allowed to process each classification. Requests tagged as sensitive are forced to on-premise deployments or providers with signed DPAs, while untagged requests follow the normal cost-optimized path. Both Portkey and LiteLLM support this pattern natively; the critical design principle is to treat the classification tag as a first-class routing input, not metadata you add later.

When should I use ML-based routing vs. rule-based routing?

Rule-based routing is the right starting point for most teams — it's fast to implement, easy to audit, and deterministic. ML-based routing (tools like RouteLLM or Not-Diamond) makes sense once you have sufficient request volume to generate useful training data and a quality measurement system to close the feedback loop. LMSYS's RouteLLM research demonstrates that trained routing classifiers can reduce frontier model calls by 40–75% with minimal quality degradation — but only if you have the infrastructure to train, evaluate, and continuously update the routing model.

What metrics should I track for a production AI model routing layer?

The five core routing metrics are: route distribution (what percentage of requests hit each model), fallback rate (how often the primary model fails or rate-limits), quality delta by route (are cheaper routes producing measurably worse outputs), cost per request by route (finding routes that are routing wrong), and latency per model (since fastest and cheapest don't always align). These metrics should feed into the same observability stack as the rest of your agent pipeline — this post's companion piece on AgentOps observability covers the tracing infrastructure in more detail.

Sources:

Building a multi-model AI stack and not sure where to start with routing? Drop me a note — happy to talk through your use case and help you design a routing layer that doesn't become its own maintenance burden.