Why your AI coding bill explodes: the cumulative-input tax explained

Your AI coding bill is not growing because model prices are rising — they keep falling. It is growing because of a structural property of how every agentic coding tool works: the model is stateless, so it must re-read the entire conversation on every single turn. That one fact turns a linear-looking workflow into a super-linear cost curve. This post explains the mechanism precisely so you know which levers actually move the needle.

Explainer · open source · self-hosted · BYOK

The model does not remember — it re-reads

When Claude Code, Cursor, Cline or any other agentic coding tool sends a request to the LLM, it does not send just the new message. It sends the entire conversation from the beginning: your original instructions, every file read, every tool call, every tool result, every assistant response — everything, every time. The model has no persistent memory between API calls. Coherence is achieved by replaying the full context window on each turn.

This is not a bug or a shortcut; it is a fundamental property of how current large language models work. The consequence for your bill is significant: you are charged for reading that context on every turn, and the context grows with every turn.

The cumulative-input tax

Think about what happens over a realistic feature task:

Turn 1 — you describe the feature, the agent reads a couple of files. Input: roughly 5K tokens.
Turn 5 — the agent has read more files, run some searches, produced a plan and a draft. Input: 10–15K tokens.
Turn 15 — failed test runs, a refactor attempt, more file reads, compiler output. Input: 20–25K tokens.
Turn 30 — still working. Each individual request now carries 25–35K input tokens, even if you only typed “yes, continue”.

Every one of those 30 turns is charged separately. You are paying for turn 1’s context on turn 1, turn 1’s context again on turn 2, and again on turn 3. By turn 30, you have paid for turn 1’s tokens roughly 30 times. That is the cumulative-input tax.

Rule of thumb

On a typical feature-length agentic session, input tokens account for the large majority of the total bill. Output is a small fraction: the model writes much less than it reads. Optimizing output tokens is not where the leverage is.

Why cost grows super-linearly

If each turn adds a fixed amount of context and you pay for the whole context on every turn, the cost of a session grows roughly with the square of its length — not linearly. A session twice as long does not cost twice as much; it costs closer to four times as much.

In practice the growth is somewhere between linear and quadratic, because context does not grow perfectly uniformly — but the direction is always super-linear. Doubling the number of turns in a session costs more than twice what the first half did.

You can put your own session parameters into the AI coding cost calculator to see the curve for the model and turn count you actually use.

The noise that never leaves

Not all re-sent context is useful. By mid-session a large fraction of the context window is occupied by things the model technically has access to but that no longer affect the direction of the task:

Full file dumps that were read early but have since been edited, so the version in context is stale.
Long tool output — test logs, git history, directory listings — that provided signal once but is now just bulk.
Failed attempts — code the agent wrote and then abandoned — still sitting in context, being paid for every turn.
Verbose scaffolding from system prompts and CLAUDE.md files that are valid instructions but much longer than necessary.

None of this is the model’s fault. It cannot selectively forget. If it is in the context window, you pay for it — on every subsequent turn, forever, until the session ends or is cleared.

The useful mental model

Every token you add to the context in turn N gets charged again on turns N+1, N+2, … until the session ends. A 10K-token tool result added at turn 5 of a 30-turn session costs you 25 × its token price in addition to the price of turn 5 itself. That is the compounding.

The four levers that attack it

Understanding the mechanism makes it obvious which interventions matter and which are mostly cosmetic.

1. Context hygiene — the highest-leverage free action

The cheapest token is the one you never resend. Clearing the conversation between sub-tasks (/clear in Claude Code; starting a new chat in Cursor) resets the cumulative tax to zero. Turn 1 of the next sub-task pays only for what it actually needs. This is free, immediate, and has the highest return of any single action. The habit is: finish a logical chunk, clear, start fresh.

Scoping what the agent reads in the first place is the companion habit. A targeted file reference adds far fewer tokens than asking the agent to explore a directory and read everything relevant — and the latter fills context with files that may turn out to be irrelevant.

2. Tool-output compression

Test output, git diffs, build logs and directory listings are the noisiest tool results. They are often large, highly repetitive, and mostly signal in the first few lines. Compressing these results before they enter the context — losslessly, preserving the information the model needs — can reduce specific payloads by 60–90%, depending on the command. More importantly, it reduces the compounding: a smaller result means smaller context on every following turn.

This is something a proxy layer can do transparently, before the result ever enters the context window.

3. Strategic prompt caching

Anthropic, OpenAI and Google all offer prompt caching: if the leading portion of a request was sent recently and matches exactly, you pay a reduced rate for those cached tokens (typically 80–90% less) instead of full price. The catch is that the cached prefix must be stable and reused often. The whole ever-changing conversation is the wrong thing to cache; a fixed system prompt or a stable project-rules file is the right thing.

The gain is real when the conditions hold — stable prefix, high reuse — and negligible or negative when they do not (a missed cache invalidation means you pay full price and wasted the write). A gateway that manages the cache boundary, rather than leaving it to the coding tool’s heuristics, is where this becomes reliable.

4. Routing and budget caps

Not every turn needs the most expensive model. Mechanical tasks — reformatting, boilerplate, simple edits — carry the same cumulative-input tax but need less reasoning power. Routing those turns to a cheaper or faster model caps their cost without sacrificing the result. Combined with a hard per-session budget that either alerts or switches model tiers when approached, this turns a surprise at invoice time into a known variable.

What the bill actually looks like

The exact numbers depend on the model, the session length, and how much tool output the task generates — and token prices change over time, so any specific dollar figure here would be out of date quickly. The mechanism, however, is stable: longer sessions, more tool calls, and more file reads multiply the input bill through the compounding effect described above.

The important implication is that two tasks of the same apparent “complexity” can have very different costs depending on how the agent approaches them. A task solved in 10 focused turns costs much less than the same task solved in 40 wandering turns, even if the final output is identical. Session hygiene and tool-output discipline are the difference.

Honest baseline

Any claim about savings should come with a measured baseline and stated conditions. Compression ratios depend on the command; caching gains depend on prefix stability and reuse. An honest savings meter shows $0 when it cannot prove a saving — not an invented number.

How merido makes the mechanism visible and controllable

The four levers above range from free habits (context hygiene) to things you cannot do by hand on every request (compressing every tool result, managing cache boundaries, routing by cost per turn). merido is an open-source, self-hosted Rust gateway that sits between your coding CLI and your LLM providers and handles the mechanical parts:

Tool-output compression intercepts bulky tool results before they enter the context window, reducing the per-turn payload and the compounding across future turns — losslessly, 60–90% on specific command output depending on the command.
Cost-, latency- and quota-aware routing spreads requests across every provider and account you own, matching model tier to task complexity with automatic failover.
Strategic cache-boundary management sets the cache boundary on the stable portion of the request so caching helps rather than backfires.
A live burn-rate meter shows what the session is spending in real time, not as a surprise at the end of the month.
Per-session budget caps stop runaway sessions where you set the line, with optional model downgrade as you approach it.
A measured savings ledger records compression and routing savings against a verified baseline and shows $0 when it cannot prove one.

ToS-clean by design

merido uses your own API keys, runs entirely self-hosted, and never pools, shares or resells credentials. It is bring-your-own-key — your billing relationship stays directly with Anthropic, OpenAI, or whichever provider you use. No key escrow, no shared accounts, no ToS risk.

It speaks the OpenAI-compatible API, so Claude Code, Codex, Cursor, Cline and Continue all work as first-class clients. You point your tool at one local endpoint and keep working exactly as before — just with the cumulative-input tax visible and the mechanical levers working on every request.

See the curve. Control the spend.

Open source, single self-hosted binary, on your own keys. Understand exactly what your sessions cost and where the compounding happens.

Get started →Read the docs

Related guides

AI coding cost calculator — plug in your session length and see the super-linear curve for your own numbers.
How to reduce Claude Code costs — eight concrete tactics that attack the cumulative-input tax.
How to reduce Cursor AI costs — the same mechanism, specific to Cursor.
How to reduce Cline costs — the same mechanism, specific to Cline.

Frequently asked questions

Why does the bill grow faster than my usage?

Because cost is dominated by cumulative input tokens, not output. Every new turn re-sends the entire conversation so the model can stay coherent — by turn 30 each individual request may carry 25–35K input tokens even if you only typed two sentences. Cost compounds with session length, not just with the number of tasks you complete.

Is switching to a cheaper model the fix?

It helps at the margin, but it does not fix the underlying mechanism. A 50% cheaper model still experiences the same super-linear growth in input tokens. The durable fix is reducing how many tokens get resent each turn — through context hygiene, tool-output compression, and strategic caching — then use model routing on top of that.

What is the single biggest lever?

Shrinking cumulative input: clear the context between sub-tasks, compress bulky tool output before it enters and compounds across turns, and cache any stable prefix so the fixed part is paid for once. These attack the mechanism directly. Budget caps then stop runaway sessions before they become bill shock.

Does merido really show $0 savings sometimes?

Yes, and that is by design. The savings ledger shows $0 when it cannot prove a saving against a measured baseline — not when savings are small, but when they genuinely are not attributable. An honest meter that sometimes reads zero is worth more than an optimistic one that always shows a big number.