Prompt Caching: The Optimization Most LLM Teams Skip

Most teams optimize their LLM stack from the wrong end. They swap models, fiddle with temperature, and rewrite prompts to shave a few tokens — while ignoring the single biggest lever for repeated workloads: prompt caching. If you send the same 8,000-token system prompt, tool schema, or document on every request, you are paying full price to re-process bytes the provider already saw seconds ago.

Prompt caching fixes that. It lets the provider store the internal representation of a stable prefix and reuse it across calls, so you pay a steep discount on the cached portion and the model starts generating sooner. For RAG systems, agents, and chat apps with long instructions, the savings are not marginal — they are the difference between a viable unit economics model and one that bleeds money.

What prompt caching actually does

When a model processes your prompt, it builds a key-value attention cache for every token. Normally that work is thrown away after the response. Prompt caching persists the cache for a stable prefix of your prompt and reuses it on subsequent requests that share the exact same prefix bytes.

The key word is prefix. Caching works left-to-right. If your prompt is [system instructions] + [tool definitions] + [retrieved docs] + [user message], the cache covers everything up to the first byte that changes. Put your volatile content (the user's message) at the end, and your stable content (instructions, schemas) at the front. Reorder them and you defeat the whole mechanism.

Both major providers expose this. Anthropic uses explicit cache_control breakpoints you place in the message structure — see the Anthropic prompt caching docs. OpenAI applies caching automatically for long prompts on supported models — see the OpenAI prompt caching guide. The economics differ but the principle is identical.

The numbers that make it worth it

Cached input tokens are dramatically cheaper than fresh ones — often a 90% discount on the cached portion — and they skip the prefill compute, which is where most of your time-to-first-token goes for long prompts.

Consider an agent with a 12,000-token system prompt and tool schema, handling 50,000 requests a day:

# Rough cost model — illustrative rates, not a price quote
STABLE_TOKENS = 12_000
REQUESTS_PER_DAY = 50_000
 
FRESH_RATE = 3.00 / 1_000_000      # $ per input token (example)
CACHED_RATE = 0.30 / 1_000_000     # 90% cheaper when cached
 
uncached = STABLE_TOKENS * REQUESTS_PER_DAY * FRESH_RATE
cached = STABLE_TOKENS * REQUESTS_PER_DAY * CACHED_RATE
 
print(f"Uncached: ${uncached:,.0f}/day")   # Uncached: $1,800/day
print(f"Cached:   ${cached:,.0f}/day")      # Cached:   $180/day
print(f"Saved:    ${uncached - cached:,.0f}/day")

That is the stable prefix alone. The latency win is just as real: skipping prefill on 12k tokens can drop time-to-first-token from hundreds of milliseconds to near-instant, which users feel directly.

Order your prompt from most-stable to least-stable: system instructions, then tool definitions, then long shared context, then the user's turn last. Caching is prefix-based — a single early byte change invalidates everything after it.

How to structure prompts for cache hits

The failure mode is subtle: people inject a timestamp, a request ID, or a per-user greeting near the top of the system prompt and silently destroy their hit rate. Anything dynamic must live after everything you want cached.

With Anthropic's explicit model, you mark the cache breakpoint at the end of your stable block:

import anthropic
 
client = anthropic.Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_STABLE_INSTRUCTIONS,  # 12k tokens, identical every call
            "cache_control": {"type": "ephemeral"},  # cache breakpoint here
        }
    ],
    messages=[
        {"role": "user", "content": user_message}  # volatile, goes last
    ],
)
 
usage = response.usage
print(usage.cache_creation_input_tokens)  # written to cache (first call)
print(usage.cache_read_input_tokens)      # served from cache (later calls)

Watch those two usage fields. cache_creation_input_tokens is what you paid to write the cache; cache_read_input_tokens is what you served cheaply. If reads stay at zero across requests that should share a prefix, your prefix is not actually stable — go hunt for the byte that changes.

Cache lifetime and warm-up

Caches expire. The default time-to-live is short (on the order of a few minutes of inactivity), so caching pays off when requests arrive in bursts or steady streams, not when they trickle in once an hour. Some providers offer extended TTLs at a higher write cost — worth it only if your traffic gap exceeds the default window. For predictable low-traffic endpoints, a cheap trick is a periodic keep-alive request to keep the prefix warm.

When caching does and doesn't help

It is not free and it is not universal. The first request to a cold cache costs more than a normal request because writing the cache carries a premium. Caching wins when reuse amortizes that write cost across many reads.

It helps a lot when:

You have a large, identical system prompt or tool schema on every call (agents, classifiers).
You run multi-turn conversations where the history grows but the prefix stays fixed.
You do RAG over a shared corpus chunk reused across many user questions.

It helps little or hurts when:

Every request has unique long context with no shared prefix.
Traffic is so sparse the cache always expires before the next hit.
Your "stable" prefix actually carries per-request data you forgot to move.

Caching changes nothing about output quality or correctness — it only reuses computation on identical input bytes. Treat it as an infrastructure optimization, and measure hit rate in production rather than assuming it works.

Measuring it in production

Instrument every call to log cache_read_input_tokens and cache_creation_input_tokens. Your north-star metric is cache hit rate: cached read tokens divided by total input tokens that could have been cached. A healthy agent workload should sit well above 80%. If it doesn't, the usual culprits are prompt reordering, non-deterministic serialization of tool schemas (dict ordering!), or a dynamic value sneaking into the prefix.

Pin your serialization. If you build tool definitions or JSON context programmatically, sort keys and freeze formatting so the bytes are identical run to run. A prompt that is semantically the same but byte-different gets zero cache benefit. This is the most common reason teams think caching "doesn't work" — it does, their bytes just keep changing.

Takeaways

Prompt caching discounts and accelerates the stable prefix of your prompt; structure prompts most-stable-first, user input last.
Expect roughly a 90% cost reduction on cached tokens and a large time-to-first-token improvement for long prefixes.
The first (cold) write costs a premium — caching only pays off under reuse within the cache TTL.
Pin your serialization (sorted keys, fixed formatting) so prefixes are byte-identical across requests.
Instrument cache_read vs cache_creation tokens and track hit rate as a first-class production metric.

Prompt Caching: The Optimization Most LLM Teams Skip

What prompt caching actually does

The numbers that make it worth it

How to structure prompts for cache hits

Cache lifetime and warm-up

When caching does and doesn't help

Measuring it in production

Takeaways

Read next

Cutting LLM Cost Without Cutting Quality: Model Routing + Caching

What Is RAG? A Practical Guide to Retrieval-Augmented Generation

RAG Isn't Dead — But Your Chunking Strategy Probably Is