How to reduce Claude Code costs: 8 ways that actually work

If your Claude Code bill keeps climbing, it is not because the model got pricier — token prices keep falling. It is because of how agentic coding tools spend tokens. Here is where the money goes and eight concrete ways to cut it, from things you can do today to things a gateway like merido does for you automatically.

Guide · open source · self-hosted · BYOK

Why Claude Code gets expensive

Claude Code, Cursor, Cline and Copilot agents all share one cost mechanism: every turn resends the entire conversation. The model is stateless, so to keep going it has to re-read everything that came before. That makes cost dominated by cumulative input tokens, not output.

The first turn might send ~5K input tokens. By turn 30, each request can carry 25–35K input tokens — and you pay for it on every request. A feature task can accumulate millions of input tokens over a session, and input typically accounts for the large majority of the bill. Worse, the context fills with things you no longer need: full file dumps, long tool output, and failed attempts that are still being paid for, turn after turn.

The lever

Because the bill is driven by re-sent input, the durable way to cut it is to shrink and cache the input that gets resent — not just to switch to a cheaper model. Keep that in mind as you read the tactics below: the high-leverage ones all target cumulative input.

8 ways to reduce your Claude Code bill

01Keep context clean — `/clear` and compact often

The cheapest token is the one you never resend. When you finish a sub-task, clear the conversation or let it compact. A fresh context for a new task means turn 1 pays for 5K tokens again instead of dragging 30K of stale history into every request. This is the single most impactful habit, and it costs nothing.

02Scope what the agent reads

Pointing the agent at a whole directory or pasting large files inflates every subsequent turn. Reference specific files and functions, and prefer targeted searches over dumping entire trees into context.

03Use a cheaper model for simple work

Not every step needs the flagship model. Renaming, boilerplate, simple edits and routine Q&A can run on a smaller, cheaper model and only escalate the hard reasoning to the expensive one. This is task-level routing — match the model to the difficulty of the step.

04Turn on prompt caching — strategically

If a stable prefix (system prompt, project rules, key files) is reused across many turns, caching it lets you pay full price once and a small fraction thereafter. The caveat: caching the whole, constantly-changing context can backfire. The win comes from caching the fixed part with high reuse — which is exactly where a gateway can decide the boundary for you.

05Compress bulky tool output

Tool results — test logs, git output, file reads, build errors — are often huge and then get resent on every following turn. Compressing that output before it enters the context removes a major source of cumulative input. Depending on the command, these specific payloads can shrink dramatically without losing the information the model needs.

06Cap a budget per session

Visibility without a brake still leads to bill shock. A hard per-session spend cap — optionally auto-downgrading the model as you approach it — turns “how much did that cost?” into a number you set in advance.

07Spread load across the keys you already own

Most developers pay for more than one provider or sit on unused free-tier quota. Routing requests across every account you own — by cost and latency, with failover when one is rate-limited or down — uses capacity you are already paying for instead of burning the most expensive option every time.

08Measure before you trust a number

Plenty of tools advertise eye-catching savings percentages. Treat any number without a baseline and conditions with suspicion — including your own. The honest way to know you are saving money is a ledger that compares against measured spend, shows the conditions, and reports $0 when it cannot prove a saving. Optimize what you can measure.

Let a gateway do the tedious parts

Tactics 1–3 are habits. Tactics 4–8 are the kind of work you do not want to do by hand on every request — and that is what merido is for. merido is an open-source, local-first AI gateway written in Rust that sits between your coding CLI and your LLM providers:

Tool-output compression shrinks bulky results before they enter context (tactic 5), losslessly.
Cost-, quota- and latency-aware routing spreads requests across every provider and account you own, with automatic failover (tactics 3 & 7).
Strategic prompt-cache control manages the cache boundary so caching helps instead of backfiring (tactic 4).
A live burn-rate meter and per-session budget caps show what you are spending and stop it where you set the line (tactics 6 & 8).
A savings ledger records measured savings against a baseline — and shows nothing when it cannot prove one (tactic 8).

ToS-clean by design

merido uses your own API keys, runs self-hosted, and never pools, shares or resells credentials. It is bring-your-own-key, your billing, your machine — the compliant way to put a gateway in front of Claude Code.

It speaks an OpenAI-compatible API and supports Claude Code, Codex, Cursor, Cline and Continue as first-class clients, so you point your CLI at one endpoint and keep working exactly as before — just cheaper and with the bill in plain sight.

See, cap, and prove your AI coding spend

Open source, single self-hosted binary, on your own keys. Get started in a couple of minutes.

Get started →Read the docs

Related guides

AI coding cost calculator — estimate your bill and see the cumulative-input tax.
How to reduce Cursor AI costs — the same mechanism for Cursor.
Claude Code vs Cursor pricing — a price-free comparison.
Self-hosted LLM gateway — own your keys, data, and routing.
Why your AI coding bill explodes — the cumulative-input tax, explained.
Prompt caching economics — when caching actually saves money.

Frequently asked questions

Why is Claude Code so expensive?

Because every turn resends the whole conversation, so cost is driven by cumulative input tokens. Long sessions pay for the same context repeatedly, and file dumps, long tool output and failed attempts pile up. By turn 30 each request can carry 25–35K input tokens.

What is the single biggest lever to cut the bill?

Shrinking and caching the input that gets resent each turn: context hygiene, tool-output compression, and strategic prompt caching of stable prefixes. Switching to a cheaper model helps for simple steps, but the durable win is reducing cumulative input.

Does prompt caching actually reduce cost?

Yes, when there is a stable prefix reused across many turns. Caching the whole, ever-changing context can backfire, so the gain depends on a fixed prefix and high reuse — which is where a gateway can set the boundary for you.

Can merido lower my bill automatically?

merido applies these tactics for you — compressing tool output, routing across the accounts you own, showing a live burn-rate meter, and recording measured savings — using your own keys, self-hosted, never pooling or reselling them. Get started here.

How to reduce Claude Code costs: 8 ways that actually work

Why Claude Code gets expensive

8 ways to reduce your Claude Code bill

01Keep context clean — /clear and compact often

02Scope what the agent reads

03Use a cheaper model for simple work

04Turn on prompt caching — strategically

05Compress bulky tool output

06Cap a budget per session

07Spread load across the keys you already own

08Measure before you trust a number

Let a gateway do the tedious parts

See, cap, and prove your AI coding spend

Related guides

Frequently asked questions

01Keep context clean — `/clear` and compact often