Strategy Saturday · AI Tooling

Stop Using One Model for Everything: A Practical Guide to AI Model Routing

Your AI stack doesn't need a single "best" model — it needs a router. Here's how to intelligently dispatch requests across multiple LLMs based on cost, complexity, and compliance, without rebuilding your stack every time a better model ships.

Published March 7, 2026 — 10 min read
TL;DR

Most teams send every AI request to one premium model, then complain about the bill. Model routing — automatically dispatching each request to the most appropriate LLM based on complexity, cost, and latency requirements — can cut token costs by 30–70% while maintaining quality. IDC forecasts that by 2028, 70% of top AI-driven enterprises will use multi-model routing architectures. The tooling to do this today is mature: LiteLLM, Portkey, RouteLLM, and Not-Diamond all ship production-ready routing. The catch is that routing adds a new layer of complexity — and if you don't instrument it properly, you'll create new failure modes faster than you fix old ones.

The One-Model Trap

When teams first ship an AI feature, they pick a model, wire it in, and ship. It works. Then the bill comes. Then the model gets deprecated. Then a new capability they need only exists in a different provider's API. Then a compliance requirement shows up saying certain data can't leave a specific cloud region.

By the time most teams realize they need flexibility, they've hardcoded model names across dozens of call sites. Switching anything requires a project.

The root problem is that no single model is optimal for every request. A simple "summarize this paragraph" task doesn't need the same model as a multi-step reasoning pipeline that writes and validates a financial analysis. Treating them the same way burns money and creates unnecessary latency.

The answer isn't to find a better single model. It's to build a routing layer — a traffic controller that looks at each incoming request and dispatches it to the right model for that specific job.

Why Model Routing Is Now a First-Class Architecture Decision

Two years ago, routing was a nice-to-have. Now it's structural. A few forces are converging:

IDC's 2026 AI and Automation FutureScape puts it plainly: by 2028, 70% of top AI-driven enterprises will use advanced multi-tool architectures to dynamically and autonomously manage model routing across diverse models. The teams building this now are getting ahead of a structural shift that's coming regardless.

Three Routing Strategies (and When to Use Each)

1. Rule-Based Routing

The simplest approach: write explicit conditions that map request characteristics to models. Short prompt + no tool use → lightweight model. Long context + structured output requirement → capable model. Contains financial data → on-premise or compliant-region model only.

Rule-based routing is fast to implement and easy to audit. The downside is brittleness — conditions need to be maintained as your use cases evolve, and they don't generalize well to novel request types.

Best for: Teams just starting with routing. High-compliance environments where behavior must be deterministic and auditable. Stable workflows with well-understood request types.

2. Semantic Routing

Instead of matching on surface features, semantic routing embeds the incoming request and compares it against a set of pre-defined route definitions using vector similarity. A request semantically close to "generate code" gets routed differently than one close to "summarize legal document," even if neither matches a keyword rule exactly.

LiteLLM's auto-routing uses an embedding model to semantically match input against utterances you define in a YAML config. This is more flexible than keyword rules and handles ambiguous requests better — but it adds an extra embedding call per request, and the route definitions still need human curation.

Best for: Varied request types within a defined domain. Teams that can invest time in defining route categories. Workflows where "close enough" routing is acceptable and performance can be measured against ground truth.

3. ML-Based Routing

The most sophisticated approach: train a small classifier (or use a pre-trained routing model) that predicts which model will produce the best outcome for a given request, based on historical performance data. The router learns from your actual usage — which model got better evals, which finished faster, which cost less for equivalent quality.

RouteLLM (from LMSYS, the group behind Chatbot Arena) is an open-source framework for this. Their research showed that trained routing models can reduce calls to expensive frontier models by 40–75% with minimal quality degradation on standard benchmarks. Not-Diamond takes a similar approach with a managed service layer — their router is trained on real model performance data and claims up to 30% cost reduction without writing a single routing rule yourself.

Best for: Teams with sufficient request volume to generate training data. Workflows where quality is measurable and you can close the feedback loop. Production systems where you're willing to invest in routing infrastructure as a first-class component.

Practical note: Most teams start with rule-based routing and evolve toward semantic or ML-based as complexity grows. Don't over-engineer the routing layer before you've shipped — a simple YAML config with three tiers (lightweight / standard / capable) beats a fancy ML router you haven't tuned yet.

The Fallback Chain: Your Safety Net

Routing doesn't eliminate provider failures — it makes handling them systematic. A fallback chain defines what happens when a model is unavailable, rate-limited, or returns an error: automatically retry with the next model in the chain.

LiteLLM's routing-load-balancing documentation covers this well: you define a priority order of models, and the router tries them in sequence on failure. LiteLLM uses Redis to track cooldown state across deployments, so a model that just hit a rate limit doesn't keep getting hammered.

A production fallback chain for a typical ops workflow might look like this:

The escalation threshold is the key design decision. Too aggressive, and you'll route complex requests to cheap models that hallucinate. Too conservative, and you've rebuilt the one-model-for-everything problem.

The Tooling Landscape

Tool Routing Type Best Fit Trade-offs
LiteLLM Rule-based + semantic + load balancing Python-native teams who want full control Most flexible; highest setup overhead. Self-hosted.
Portkey Rule-based + semantic + gateway-level routing Teams wanting routing + observability + guardrails in one layer Managed service. Adds a dependency; excellent dashboard.
RouteLLM ML-based (trained classifier) Teams with volume who want to minimize frontier model usage Open-source, research-grade. Needs training data to tune.
Not-Diamond ML-based (managed, pre-trained) Teams that want ML routing without training infrastructure Managed API. Less customizable than RouteLLM.
Helicone Rule-based + load balancing Observability-first teams; Rust-based performance 11 µs overhead. Better as observability gateway than pure router.
Cloudflare AI Gateway Rule-based + caching + rate limiting Edge-deployed apps; low-latency global routing Best for latency optimization; less sophisticated on routing logic.

Compliance and Data Sovereignty Routing

This is the routing use case that non-technical stakeholders care most about — and it's frequently underbuilt.

Enterprises handling regulated data (healthcare, financial, legal) increasingly need guarantees that specific request types don't transit certain networks or get processed by models hosted in non-compliant regions. A routing layer is the enforcement point for these policies.

Practically, this means adding metadata to requests — a "data classification" field on each request that the router reads and uses to enforce model selection rules. A request tagged contains-PII routes to an on-premise model or a provider with a signed DPA. An untagged request gets the normal cost-optimized path.

Portkey supports this natively through its gateway config. LiteLLM supports it through custom routing callbacks. If you're building compliance routing from scratch, treat the classification tag as a first-class input, not an afterthought.

What Good Routing Instrumentation Looks Like

A routing layer you can't observe is just a black box you added to your stack. At minimum, instrument these signals:

This connects directly to the AgentOps observability stack covered in last week's Tech Tuesday post — routing metrics should flow into the same tracing infrastructure as the rest of your agent pipeline.

The Decision Tree: Do You Need a Router Yet?

Not every team needs a routing layer on day one. Here's a simple filter:

The Bottom Line

The "best model" debate is the wrong frame. Different requests need different models, and the teams that recognize this early build AI infrastructure that stays cost-effective and resilient as the model landscape keeps shifting.

Start simple: a three-tier rule-based setup with LiteLLM or Portkey, a fallback chain across two providers, and instrumentation on route distribution and fallback rate. That alone will expose 80% of the optimization opportunities in most stacks.

Then, once you have volume and measurement, look at semantic or ML-based routing for the next level of precision. The infrastructure is there. The question is just whether you've built the observability foundation to use it well.

Frequently Asked Questions

What is AI model routing and why does it matter in 2026?

AI model routing is the practice of automatically dispatching each AI request to the most appropriate language model based on factors like task complexity, cost, latency requirements, and compliance constraints — rather than sending everything to a single model. It matters because the modern AI landscape offers dozens of viable models with very different cost and capability profiles, and using the same premium model for every request wastes money without improving quality. IDC forecasts that 70% of top AI-driven enterprises will use multi-model routing architectures by 2028.

How much can AI model routing actually reduce LLM costs?

Organizations using model routers commonly report 30–70% cost reductions while maintaining output quality, according to practical benchmarks from routing tool vendors like MindStudio and Portkey. For specific workloads where lightweight models handle the majority of requests, some teams achieve up to 98% cost reduction on that request category. A customer support chatbot handling 100,000 daily requests that routes 80% of queries to cheaper models can drop from ~$4,500 to ~$1,500 per month — even with the routing overhead.

What is the difference between LiteLLM and Portkey for model routing?

LiteLLM is an open-source, self-hosted Python proxy that provides load balancing, semantic routing, and fallback chains across 100+ LLM providers — it gives maximum flexibility but requires more setup and ops overhead. Portkey is a managed service that combines routing with observability, guardrails, and prompt management in one platform — it's faster to get started and has a strong dashboard, but it introduces a vendor dependency. Both are production-ready; the choice depends on whether your team prefers full control (LiteLLM) or an integrated managed layer (Portkey).

How do I handle compliance requirements in an AI model routing setup?

Compliance routing works by attaching a data classification tag to each AI request — for example, contains-PII or regulated-financial-data — and writing routing rules that enforce which models are allowed to process each classification. Requests tagged as sensitive are forced to on-premise deployments or providers with signed DPAs, while untagged requests follow the normal cost-optimized path. Both Portkey and LiteLLM support this pattern natively; the critical design principle is to treat the classification tag as a first-class routing input, not metadata you add later.

When should I use ML-based routing vs. rule-based routing?

Rule-based routing is the right starting point for most teams — it's fast to implement, easy to audit, and deterministic. ML-based routing (tools like RouteLLM or Not-Diamond) makes sense once you have sufficient request volume to generate useful training data and a quality measurement system to close the feedback loop. LMSYS's RouteLLM research demonstrates that trained routing classifiers can reduce frontier model calls by 40–75% with minimal quality degradation — but only if you have the infrastructure to train, evaluate, and continuously update the routing model.

What metrics should I track for a production AI model routing layer?

The five core routing metrics are: route distribution (what percentage of requests hit each model), fallback rate (how often the primary model fails or rate-limits), quality delta by route (are cheaper routes producing measurably worse outputs), cost per request by route (finding routes that are routing wrong), and latency per model (since fastest and cheapest don't always align). These metrics should feed into the same observability stack as the rest of your agent pipeline — this post's companion piece on AgentOps observability covers the tracing infrastructure in more detail.

Sources:

Building a multi-model AI stack and not sure where to start with routing? Drop me a note — happy to talk through your use case and help you design a routing layer that doesn't become its own maintenance burden.