Cutting LLM Cost Without Cutting Quality: Model Routing + Caching
Photo: Unsplash
The default architecture for an LLM feature is "send everything to the best model and hope the invoice is survivable." It's a fine way to ship a prototype and a terrible way to run a business. The dirty secret of LLM cost is that most of your requests don't need your most expensive model — they're simple classifications, short rewrites, and FAQ-shaped questions that a model a fraction of the price answers just as well.
The two highest-leverage cost techniques don't touch quality at all, because they don't change what you answer — only how you route and reuse. Model routing sends each request to the cheapest model that can handle it. Caching avoids paying for work you've already done. Stack them and a realistic production workload sheds the majority of its spend with no measurable quality regression. The catch is doing it without quietly degrading the hard requests — so let's build it carefully.
Why one model for everything is wasteful
Frontier models are priced for frontier tasks: multi-step reasoning, nuanced writing, hard tool use. But the request mix of a typical app is bimodal. A large chunk is genuinely easy — "is this message spam?", "summarize this in one line", "what are your hours?" — and a smaller chunk is genuinely hard. Pricing across the major providers spans more than an order of magnitude between their small and large models, so routing the easy bucket to a cheap model is the single biggest lever you have.
The published price ladders — the OpenAI pricing page and Anthropic's model overview — make the spread concrete. The mistake is paying top-tier rates for bottom-tier difficulty.
Routing: send each request to the right tier
A router is a cheap classifier that runs before your main call and decides which model handles the request. The trick is making the routing decision itself cheap — if you spend a big-model call deciding which model to use, you've lost the game. Three approaches, in increasing sophistication:
Heuristic routing. Cheapest and surprisingly effective. Route on signals you already have: input length, the endpoint, presence of code, whether tools are needed.
def route(request) -> str:
if request.task == "classify" or len(request.text) < 500:
return "small" # cheap model handles short, simple tasks
if request.needs_tools or request.task == "reasoning":
return "large" # reserve the expensive model for hard work
return "medium"
MODELS = {"small": "gpt-4o-mini", "medium": "claude-sonnet-4-5", "large": "o3"}
response = call_model(MODELS[route(request)], request)Classifier routing. Train a small, fast classifier (even a fine-tuned embedding model) to predict difficulty from the prompt. More accurate than heuristics, still near-free per call.
Cascade / escalation routing. Try the cheap model first, then verify its answer and escalate to a bigger model only if confidence is low. This is the most quality-preserving pattern because hard requests still reach the strong model — they just take a cheap detour first.
def cascade(request):
cheap = call_model(MODELS["small"], request)
if is_confident(cheap): # self-rated, schema-validated, or heuristic
return cheap
return call_model(MODELS["large"], request) # escalate only when neededRouting only saves money if the cheap model actually handles its share. Audit routed-down requests against your evals — a router that quietly degrades 5% of answers to save 40% cost is a bad trade you won't see without measurement. Tie the router to your eval suite, not to vibes.
Caching: stop paying twice
Routing picks the cheapest model; caching avoids calling any model at all. There are two distinct layers, and you want both.
Exact-match response cache. If you've answered this exact input before, return the stored response. Trivial to build with a hash key, and shockingly effective for workloads with repeated queries (FAQs, popular documents, retried requests).
import hashlib, json
def cache_key(model: str, messages: list) -> str:
blob = model + json.dumps(messages, sort_keys=True)
return "llm:" + hashlib.sha256(blob.encode()).hexdigest()
def cached_call(redis, model, messages, ttl=3600):
key = cache_key(model, messages)
if hit := redis.get(key):
return json.loads(hit) # zero model cost
resp = call_model(model, messages)
redis.setex(key, ttl, json.dumps(resp))
return respSemantic cache. Exact-match misses on paraphrases — "what are your hours" vs "when are you open". A semantic cache embeds the query and returns a cached answer when a previous query is close enough in vector space. Powerful, but set the similarity threshold conservatively: too loose and you'll serve a confidently wrong cached answer to a subtly different question.
Provider prompt caching. A third, complementary layer: providers discount the repeated prefix of your prompts (system instructions, tool schemas). It stacks on top of your own caching — see OpenAI prompt caching and Anthropic prompt caching.
Stack the layers. Provider prompt caching cuts the cost of the calls you do make; your response cache eliminates calls entirely; routing makes the remaining calls cheaper. They compound — each one multiplies the savings of the others.
Measure quality, not just the invoice
The whole premise is "without cutting quality," so you must prove you didn't. Before rolling out routing or semantic caching, fix a baseline by running your eval suite against the all-frontier-model setup. Then run the same suite against the optimized pipeline. If the score holds within noise, the savings are real and free. If it drops, you've found a request class that was routed or cached too aggressively — tighten that rule and re-measure.
Instrument production with the cost and quality dashboard side by side: cost-per-request, cache hit rate, routing distribution, and your live eval score. Optimize the first three only while the fourth stays flat. The goal isn't the cheapest possible bill — it's the cheapest bill that holds your quality bar.
Takeaways
- Sending every request to your biggest model is the default and the waste; request difficulty is bimodal.
- Route each request to the cheapest tier that handles it — heuristics first, cascades when quality matters most.
- Layer caching: exact-match response cache, semantic cache (with a conservative threshold), and provider prompt caching.
- The layers compound — routing, response caching, and prompt caching each multiply the others' savings.
- Gate every optimization on your eval suite; cost wins only count if the quality score stays flat.

