AI gateway
A server that sits between your coding tools and LLM providers, handling routing, auth, and observability so clients only need one endpoint. Think of it as a reverse proxy for language-model API calls.
Plain-English definitions of the terms behind AI coding costs and gateways. Use these anchors to link to a specific term.
A server that sits between your coding tools and LLM providers, handling routing, auth, and observability so clients only need one endpoint. Think of it as a reverse proxy for language-model API calls.
Another name for an AI gateway, emphasizing the HTTP-forwarding role. A proxy relays requests to one or more upstream providers and returns the response, often normalizing the wire format along the way.
A deployment model where you supply your own provider API keys rather than using pooled credentials managed by a third party. Your keys are never shared and you remain in direct billing relationship with the provider.
The running total of all tokens sent to a model across a session or task, not just the latest message. Because agentic tools replay the full conversation history each turn, cumulative input tokens grow faster than most developers expect.
The maximum number of tokens a model can process in a single request, covering both the prompt and the generated reply. Everything outside this window is invisible to the model; hitting the limit forces truncation or re-summarization.
A workflow where an AI assistant autonomously plans, edits, runs tools, and iterates to complete a programming task with minimal human steps per turn. Each tool call and observation is appended to the context, which is why costs compound quickly.
A provider feature that reuses a previously processed prefix of a prompt, charging a lower cache-read rate for the cached portion instead of reprocessing it. Savings depend on how much of your system prompt stays constant across turns.
Filtering or truncating the output returned by CLI tools before it is sent to the model, reducing input-token volume. Compression ratios of 60–90% on specific command output are achievable depending on the command, though results vary.
A REST interface that follows the same request and response shapes as OpenAI's Chat Completions endpoint, so any client written for OpenAI also works against it. Most AI gateways and many non-OpenAI providers expose this format for drop-in compatibility.
A gateway feature where a single named model identifier maps to a policy that selects among multiple real models or accounts at runtime. The policy might prefer lowest latency, lowest cost, or highest availability, transparent to the client.
Failover automatically retries a request with an alternate provider or account when the primary fails. A circuit breaker adds state: after a threshold of failures it temporarily stops sending traffic to the degraded target, preventing cascading delays.
Showback reports AI spending by team or project for visibility without moving money. Chargeback goes further: the usage cost is actually allocated or billed back to the consuming team's budget.
The basic unit of text that language models process, roughly corresponding to three to four characters of English text. Providers bill by the number of input and output tokens in each request.