How to monitor and control AI coding costs: a FinOps playbook for engineering teams

AI coding tools have moved from experiment to infrastructure. With that shift comes a bill that is growing faster than anyone forecast — and, for most engineering teams, a near-total lack of visibility into where it is going. This is a FinOps playbook for the engineering lead who needs to see the spend, cap it, prove any savings honestly, and eventually attribute costs to teams and projects.

Playbook · FinOps · team spend · cost observability · chargeback

The real cost driver (in one paragraph)

AI coding tools like Claude Code and Cursor are agentic: they send the full conversation context — including every tool result, every file read, every test log — back to the model on every turn. The bill is almost entirely cumulative input tokens, not output. A long session with a chatty tool pipeline can cost 10× more than the naive per-query estimate suggests, because the context window grows and gets repriced on every round-trip. Understanding that mechanism is the prerequisite for everything else in this playbook — see why AI coding bills explode for the full breakdown. The tactics below are most effective when you know which lever attacks which part of that cost curve.

Phase 1: SEE IT — spend visibility and showback

You cannot manage what you cannot see. The first goal is not to cut anything — it is to get clean, per-session, per-developer, per-model numbers in front of the people who make decisions.

01Instrument at the gateway, not the client

The only place you get full fidelity — tokens-in, tokens-out, model, latency, cost estimate — without modifying every developer’s tooling is a proxy that sits between the coding CLI and the upstream provider. If developers call providers directly, you have no aggregate view and no ability to enforce anything later. A self-hosted LLM gateway is the natural instrumentation point: every request passes through it, so every request is recorded.

What to record per request: timestamp, developer/session identifier, model requested, model actually used (they may differ after routing), input tokens, output tokens, cost at the provider’s published rate, and which routing or optimization levers fired. That last column is what makes a savings ledger possible later.

02Build a live burn-rate view

A monthly report is not enough — by the time it arrives, the expensive session is two weeks old. What you need is a near-real-time burn-rate view: cost-per-hour, rolling 24-hour spend by developer, and an alert when a single session crosses a configurable threshold (say, more than 10× the median session cost). Spikes are almost always agentic pipelines running without a human in the loop — a long autonomous task, a test-runner loop, or a scaffolding job that hit an error and retried. Seeing the spike within minutes means you can intervene before it becomes a line item.

03Showback before chargeback

Showback means every team can see their own spend. The bill is still paid from one place, but the data is visible. This is the right first step — it changes behavior on its own, because developers who can see their session cost in real time make different decisions than developers flying blind.

Chargeback — where each team’s budget is actually debited — is the harder governance step. It requires trusted attribution, internal cost-allocation structures, and usually an executive decision to change how AI spend is treated in the budget. Start with showback. Once attribution data is clean and trusted for a quarter or two, chargeback becomes a reporting change, not a systems change.

Showback first

Publish per-developer and per-team spend weekly, even before you have caps or optimization in place. The visibility alone reliably changes behavior — not because anyone is being surveilled, but because developers who can see the meter running make different session choices.

Phase 2: CAP IT — budgets, guardrails, and enforcement

Visibility tells you where the money is going. Caps are how you keep it from going there indefinitely. The goal is not to block developers — it is to have guardrails that fire before spend becomes a problem, and to make those guardrails transparent enough that developers can work with them rather than around them.

04Set per-session and per-day token budgets

A daily budget per developer is the right granularity to start. Use the AI coding cost calculator to estimate reasonable baselines from your team’s actual usage before you set the cap — a budget that is too tight will generate friction immediately, and you will spend the first month managing exceptions instead of managing costs.

Per-session caps are a complement, not a substitute. A developer can stay under a daily budget while still running one runaway agentic session that consumes most of it. A per-session cap (say, 500K input tokens) catches those spikes without constraining normal usage.

05Downgrade before you block

A hard block when a budget is hit is a last resort. The better pattern is a soft cap that routes to a cheaper model tier when spend approaches the limit. If a developer is near their daily budget, route subsequent requests from a frontier model (expensive, high-context) to a mid-tier model (cheaper, still capable) — most coding tasks are not sensitive to model tier, and the developer keeps working.

Make the downgrade visible, not silent. A silent degradation that the developer only notices when quality drops is worse than an explicit notification: “You are on the economical model tier today (budget threshold reached). Request a budget increase or wait until tomorrow.”

06Special rules for automated agents

Agentic pipelines — scaffolding jobs, CI-integrated code review, test generation, automated refactoring — are the highest-cost workloads and the least supervised. They need tighter caps than interactive sessions, not looser ones. A hard stop at a lower threshold (say, 200K tokens) is appropriate here because an automated agent that is going to consume 2M tokens is almost certainly stuck in a loop or hitting an error condition it cannot resolve — stopping it early saves money and surfaces the bug.

Phase 3: PROVE IT — a measured savings ledger

Once you have visibility and caps in place, pressure to add optimization levers will come quickly. That pressure brings a risk: optimization tools that report impressive savings that are impossible to verify. An honest savings ledger is harder to build than a vanity dashboard, but it is the only thing that will survive a CFO’s scrutiny.

07Measure a baseline first

You cannot prove a saving without a counterfactual. Before turning on any optimization lever, record: median input tokens per session, median output tokens, model mix, and total cost per developer-day. This is your baseline. When you turn on a lever — tool-output compression, model downgrade, prompt caching — the saving is the measured delta against that baseline for equivalent sessions, attributed to exactly that lever.

“Equivalent sessions” is the hard part. If developers happen to be running shorter tasks the week you turn on compression, the cost drop is not the compression. A robust baseline measures enough sessions (several weeks of representative work) and controls for session length before declaring a saving.

08One saving per lever — no double-counting

If tool-output compression fires on a request that also gets a prompt cache hit, you must attribute the saving to one lever or split it carefully — you cannot credit the full counterfactual cost to both. Double-counting is the most common way savings ledgers inflate their numbers, and it is usually invisible unless you look at the per-request attribution data. A gateway that records which levers fired per request makes this auditable.

09Trust a ledger that can show $0

The most important feature of an honest savings ledger is not the green rows — it is the rows that show $0 or even a negative number. A session with no compressible tool output gets $0 in the compression column. A cache miss that added latency and tokens gets a negative row. Those honest entries are what make the positive entries credible.

If every row in your savings report is positive and every lever shows a saving on every session, the system is not measuring — it is marketing. Push the vendor or the tool for the $0 rows. If they cannot produce them, the numbers are not trustworthy.

Honest measurement

merido’s savings ledger is built around this principle: each saving is measured against a recorded baseline, attributed to one lever, and reported as $0 when it cannot prove a delta. A ledger that can’t show $0 isn’t a ledger — it’s a story.

Phase 4: CHARGEBACK — attribution to teams and projects

Showback tells teams what they spent. Chargeback makes that spend real in their budget. Getting there requires clean attribution, organizational buy-in, and tooling that can route cost data into your internal accounting — which is why it is the last phase, not the first.

10Build the attribution model before you need it

Attribution at the gateway level is straightforward: tag each request with developer, team, and optionally project (from an API key or a request header). The cost rolls up cleanly. The hard part is agreeing on what the unit of attribution is — per developer? per project? per product area? — and making sure those tags are applied consistently from day one. Retrofitting attribution to historical data is painful. Start tagging at the gateway from the moment you deploy it.

11Distinguish shared infrastructure from team spend

Some AI spend is legitimately shared: a centrally run CI agent, a shared code-review pipeline, an internal assistant used by the whole org. That spend should not be allocated to any one team — it lives in a shared cost pool, like your cloud networking costs. Define that boundary early, because it affects how you present the numbers to leadership and how teams perceive the fairness of the allocation.

12Chargeback is a governance decision, not a tooling decision

Once your attribution data is clean and trusted, chargeback is primarily an organizational decision: does the company want to hold teams accountable for their AI spend in their budgets? That requires buy-in from Finance and team leads, a decision about how to handle budget overruns, and a process for teams to request budget increases. The tooling — the gateway, the cost reports, the API — is table stakes. The governance process is the actual work.

Where merido fits in this playbook

merido is a self-hosted, BYOK LLM gateway designed as the control plane for exactly this kind of team FinOps program. It runs on your own infrastructure — your keys never leave your network — and provides the instrumentation layer the whole playbook depends on:

Per-request observability. Every request is recorded with tokens, cost, model, latency, and which optimization levers fired — the raw material for showback, budgets, and an honest savings ledger.
Session and daily caps. Configurable token budgets per developer or API key, with model-downgrade enforcement before a hard block — so developers keep working at a lower tier rather than hitting a wall.
Tool-output compression. Bulky tool results (test logs, file reads, git output) are compressed before they re-enter context, removing a major source of repeatedly-billed input tokens. The saving is measured per-request against the uncompressed size and reported in the ledger — not estimated.
A savings ledger that reports $0 when it can’t prove a saving. Cache attribution is excluded to avoid the most common over-claim. Each row is signed to the lever that produced it.
Attribution tagging. Tag requests by team or project via API key or header; showback reports roll up cleanly. Team-tier chargeback and governance features are on the roadmap.

merido measures and enforces — it does not promise specific dollar savings. The actual saving depends on your team’s workload, model mix, and session patterns. Use the cost calculator to estimate your current spend before deploying, so you have a baseline to measure against — not just a vendor’s headline number to trust.

Get AI coding spend under control

Open source, self-hosted, BYOK. One gateway gives your team visibility, caps, and a measured savings ledger — starting today.

Get started →Read the docs

Related guides

Why your AI coding bill explodes — the cost mechanism behind every line item.
AI coding cost calculator — estimate per-developer spend before committing.
Self-hosted LLM gateway — the control plane that makes this playbook possible.
How to reduce Claude Code costs — 8 concrete tactics for the coding CLI.

Frequently asked questions

What is the difference between showback and chargeback for AI coding spend?

Showback means you can tell every team exactly what they spent — you report the numbers, but the bill stays in one place. Chargeback means each team's budget is actually debited: the cost attribution creates a real financial consequence. Showback is the right starting point and is much easier to implement (no changes to billing structure). Chargeback is the harder governance step that requires internal cost allocation, team budgets, and executive buy-in. Start with showback — it changes behavior on its own — and move to chargeback once attribution data is trusted.

How do I prove AI coding savings to a CFO or VP of Engineering?

The only defensible method is to measure a baseline (tokens-in and tokens-out per session, per developer, per model, before any optimization) and then compare production numbers against that baseline after each lever is turned on. Each saving must be attributed to exactly one lever (compression, model downgrade, cache hit) to avoid double-counting. If a saving cannot be traced to a measured delta, the ledger should show $0 — that is the sign of an honest measurement system. A savings report that never shows $0 or a negative row is almost certainly over-counting.

Can I cap per-developer AI coding spend without blocking their work?

Yes — the practical approach is a soft cap that triggers a model downgrade before it triggers a hard stop. Set a daily or weekly token budget per developer; when they approach it, the gateway routes requests to a cheaper model tier automatically instead of rejecting them. A hard block is a last resort for runaway sessions or automated agents without human oversight. The key is to make the cap visible to the developer (not a silent degradation) so they can adjust their workflow or request a budget increase.

Why does an honest savings ledger sometimes show $0 or a negative number?

Because the alternative — a ledger that always shows a positive saving — means the system is inventing numbers. A measured savings ledger should show $0 when it cannot attribute a delta to a specific lever (e.g. the session had no compressible tool output, or the model was already the cheapest available). It should show a negative row when a tactic increased cost in a specific session (e.g. a cache miss that added latency and tokens). These honest rows are what make the positive rows trustworthy. If every row is green, the system is not measuring — it is marketing.