Prompt caching economics for coding agents: when it actually saves money

Q: Should I cache the whole context window?

No. The value of caching comes specifically from the stable, high-reuse portion of the input — not the churning tail. Trying to cache the full, ever-growing context typically pushes the cache boundary into territory that changes every turn, which means constant cache misses and write premiums you never recover. The right approach is to identify the fixed prefix (system prompt, project rules, reference files) and cache only that, letting the dynamic part — conversation history, tool results — flow through uncached.

Prompt caching is one of the more credible levers for cutting LLM input costs — but the ROI is conditional in ways that vendor marketing tends to understate. This post explains the economics precisely: when caching genuinely saves money, when it backfires, and what you need to measure before trusting any savings figure.

Deep dive · advanced · BYOK · self-hosted

How prompt caching actually works

Large language models are stateless: every request carries the full context from scratch. For a coding agent, that means every turn resends the system prompt, project rules, any pinned files, and the entire conversation history. As context grows, input costs compound — you pay for the same tokens repeatedly. This cumulative-input mechanism is the root cause of most large AI coding bills.

Prompt caching is a provider-level optimization that lets you mark a prefix of that input as cacheable. The first time it is processed, the provider stores it (cache write). On subsequent requests that include the same prefix, the provider retrieves it from cache instead of re-processing it (cache hit), and charges you a fraction of the normal input rate for those tokens.

The economic structure looks like this:

Cache write: typically higher than the standard input rate — you pay a premium to populate the cache.
Cache hit: a fraction of the standard input rate — the source of savings on every subsequent request.
Cache miss / uncached tokens: billed at the standard input rate, same as always.
TTL: cached prefixes expire after a provider-defined window (often a few minutes to a few hours). After expiry, the next request pays the cache-write premium again.

The key insight

Caching is not free — it is a bet. You pay a write premium upfront and collect savings on every cache hit. Whether the bet pays off depends entirely on how many hits you get before the prefix changes or the cache expires.

When caching genuinely saves money

The conditions for a positive ROI are specific. All three must hold:

1. A large, stable prefix

Caching only pays if the cached portion is large enough that the per-token savings on a hit materially exceed the overhead. A 200-token system prompt saves almost nothing even with perfect reuse. A 20,000-token prefix of project rules, CLAUDE.md, key reference files, and a detailed system prompt can recover a substantial share of that input cost per hit.

Crucially, the prefix must be stable. If it changes every turn — or even every few turns — the cache boundary keeps shifting, and new writes keep accumulating without the hits to pay for them.

2. High reuse within the TTL

A cached prefix that is hit once saves you a small fraction of one request. One that is hit twenty times in a session saves a meaningful share of the input cost for those twenty turns. The math requires a minimum number of hits to break even against the write premium — and more turns to show genuine net savings.

This is workload-dependent. A long coding session working in one codebase with a fixed system prompt is close to the ideal case. A short script-run or a session that constantly expands into new files is not.

3. Sessions long enough to amortize the write

If a session ends before the cache is hit enough times, the write premium goes unrecovered. Very short sessions — a single-turn lookup, a quick file edit — rarely benefit from caching. The win is structural in longer agentic tasks where many turns share the same stable context.

When caching backfires

Caching the whole context window

The most common mistake is naively caching the entire input, including the conversation history and recent tool results. That portion changes on every turn. Setting the cache boundary in the churning tail means the cache prefix is different on every request — constant cache writes, near-zero cache hits, and a net-positive write premium with nothing to show for it. You can end up paying more than the baseline uncached cost.

Low-reuse or short-lived prefixes

If the system prompt is small, the session is short, or the workload constantly pulls in new large files, caching provides little benefit and the write premium is a pure loss. The economics require reuse that simply is not there.

Assuming provider defaults are optimal

Some providers and clients auto-cache the longest possible prefix. That heuristic can accidentally cache deeply into the dynamic portion of context, creating the worst-case write-without-hits pattern described above. Auto-caching that you do not control is not necessarily optimal caching.

The compression/caching illusion

Prompt caching is often marketed alongside compression as though both reliably cut costs. They can — but neither is unconditional. The honest version: caching recovers input cost on the stable prefix under high reuse. Compression removes tokens before they enter context at all. Both require measurement, not assumption, to know whether they are helping in your workload.

The right mental model: cache the fixed part, not the full input

Think of your input as having two zones:

Fixed prefix: system prompt, project rules, static reference files, CLAUDE.md. This changes rarely or never within a session.
Dynamic tail: conversation history, tool results, new file reads, in-progress context. This grows and changes every turn.

Caching is economically sound only for the fixed prefix. The dynamic tail should flow through uncached. The boundary between them is not a setting most clients expose — it is a decision that requires understanding which part of the input is actually stable.

A gateway that understands your request structure can set that boundary correctly: cache control headers on the fixed prefix, nothing on the tail. That is not magic — it is just applying the economic logic correctly instead of leaving it to a naive heuristic.

What honest measurement looks like

To know whether prompt caching is saving you money, you need a baseline. That means measuring what you would have paid without caching (full input rate on all tokens) and comparing it against what you actually paid (write premium on cache writes, hit rate on hits, standard rate on misses). The difference — if positive — is the real saving.

A number without a baseline and stated conditions is not a savings figure: it is a marketing claim. The cost calculator can help you estimate the break-even point for your workload before you rely on caching to carry the bill.

Two things to watch in particular:

Cache hit rate: if this is low (say, under 50% in a long session), the math may not be working. Investigate whether the prefix is actually stable.
Write cost accumulation: if cache writes appear frequently without corresponding hits, you are paying the premium without collecting the dividend.

The $0 principle

A savings ledger that can show you a positive number should also be able to show you $0 — and should, whenever the conditions are not met. If a tool only ever shows savings and never shows zero, it is not measuring: it is estimating optimistically.

How merido approaches prompt caching

merido applies the economics above rather than assuming caching always helps. When it sets cache control, it targets the fixed prefix — system prompt, project rules, static content — and leaves the dynamic tail uncached, based on what it can observe about the request structure.

The savings ledger records what was actually saved relative to the uncached baseline for those tokens, turn by turn. When the conditions are not met — short session, no stable prefix, low hit rate — the ledger shows $0. merido uses your own API keys, runs self-hosted on your machine, and never pools, resells, or shares credentials.

See actual caching economics in your sessions

Open source, single self-hosted binary, on your own keys. Cache control set at the right boundary, savings measured against a real baseline.

Get started →Read the docs

Related guides

8 ways to reduce Claude Code costs — broader cost-reduction tactics including caching.
AI coding cost calculator — estimate what prompt caching is worth for your workload.
Why AI coding bills explode — the cumulative-input mechanism that caching is trying to solve.
Self-hosted LLM gateway — own your keys, routing, and cache boundary decisions.

Frequently asked questions

Does prompt caching actually save money?

Yes — under the right conditions. Caching is economically sound when a large, stable prefix (system prompt, project rules, pinned files) is reused across many turns within the cache TTL. The savings on each cache hit must outweigh the one-time cache-write premium. If reuse is low, the prefix changes constantly, or sessions are short, the numbers may not work in your favor. Measure against a baseline before treating any savings figure as reliable.

Why can prompt caching backfire?

Three reasons. First, most providers charge a cache-write premium (higher than the normal input rate) to store the prefix. If the cache is rarely hit, you pay that premium without recovering it. Second, if you try to cache a context window that changes every turn — tool outputs, new file reads, ongoing conversation — the cache boundary keeps shifting and hits never materialize. Third, short sessions expire before the cache TTL pays off. The math only works when the stable portion is genuinely large and the session is long enough to amortize the write cost.

Should I cache the whole context window?

No. The value of caching comes specifically from the stable, high-reuse portion of the input — not the churning tail. Trying to cache the full, ever-growing context typically pushes the cache boundary into territory that changes every turn, which means constant cache misses and write premiums you never recover. The right approach is to identify the fixed prefix (system prompt, project rules, reference files) and cache only that, letting the dynamic part — conversation history, tool results — flow through uncached.

How does merido decide what to cache?

merido sets the cache boundary at the stable prefix — the fixed portion that does not change across turns — and leaves the dynamic tail uncached. It then measures actual savings against a pre-caching baseline and records them in a signed ledger. If the measured saving is zero or the conditions are not met, the ledger shows $0. It uses your own API keys, runs self-hosted, and never pools or resells credentials.