Self-hosted LLM gateway: own your keys, your data, your routing

A gateway in front of your LLM providers is the right architecture — but you do not have to rent it from a cloud SaaS. Run your own and the keys, the traffic and the routing logic stay on infrastructure you control. Here is what a self-hosted LLM gateway is, why you would want one, what to look for, and how merido delivers it as a single open-source binary.

Guide · open source · self-hosted · BYOK · Rust

What a self-hosted LLM gateway is

An LLM gateway is a proxy that sits between your applications — or coding CLIs like Claude Code, Codex and Cursor — and the model providers behind them. Instead of every client talking directly to a dozen different APIs, they all hit one stable endpoint. The gateway then resolves the model, picks a provider, handles failover, and adds the cross-cutting concerns you would otherwise reimplement in every app: authentication, rate limiting, observability, cost tracking and token optimization.

A self-hosted gateway is simply that proxy running on infrastructure you own — your laptop, a VM, a container in your cluster — rather than as a hosted service someone else operates. The difference is not cosmetic: with a self-hosted gateway, your API keys and your request/response data never leave your boundary.

Why run your own instead of a cloud router

Hosted LLM routers are convenient, but they put a third party on the critical path of every request. Running the gateway yourself buys back four things that matter once you are past a prototype:

Data and keys stay on your infra. Prompts, completions and credentials never transit a vendor’s servers. For regulated, proprietary or just sensitive codebases, that is often the deciding factor.
No per-request markup. Hosted routers typically add a margin on top of provider pricing. Self-hosting means you pay the provider directly — the gateway’s only cost is the box it runs on.
No vendor lock-in. The gateway is open source and the API is standard, so you are never trapped behind a proprietary control plane or a pricing change you did not choose.
ToS-clean BYOK. You call providers with your own keys, under your own billing — never pooling, sharing or reselling credentials through someone else’s account. That is the compliant way to do bring-your-own-key.

What to look for in a self-hosted gateway

01An OpenAI-compatible API

The whole point of a gateway is that clients do not have to change. If it speaks the OpenAI-compatible API, your existing SDKs and coding CLIs just repoint their base URL and keep working. Bonus points if it also translates Anthropic, Gemini and Responses formats internally, so one endpoint can front providers that speak different wire formats.

02Multi-provider routing and failover

A single hard-coded provider is a single point of failure. Look for routing that can spread requests across the providers and accounts you own, by cost and latency, and fail over automatically when one is rate-limited, erroring or down — ideally with circuit breakers so a flaky upstream is parked instead of retried into the ground.

03Observability you can act on

You cannot optimize what you cannot see. The gateway should record tokens, cost and latency per request, expose metrics, and ideally give you a dashboard rather than just a log file. Spend visibility is the prerequisite for every cost decision you will make later.

04Token and cost optimization

Agentic coding tools resend the whole conversation every turn, so bulky tool output — test logs, git output, file reads — gets paid for again and again. A gateway that compresses those payloads before they re-enter context removes a real source of spend. Caps and budgets turn “how much did that cost?” into a number you set in advance. (For the full playbook, see how to reduce your Claude Code costs.)

05Performance under sustained load

The gateway is on the hot path of every call, so its own overhead and stability matter. A compiled, low-overhead runtime keeps the proxy from becoming the bottleneck when many streaming requests are in flight at once.

06Single-binary simplicity

The fewer moving parts, the less there is to operate and secure. A gateway you can run as one process — no sidecars, no required external database to get started — is dramatically easier to self-host than a stack of services you have to wire together yourself.

How merido fits

merido is an open-source, local-first LLM gateway written in Rust, built to be the gateway you run yourself:

One static binary, one port. The local profile runs on embedded SQLite and serves the API, a built-in dashboard and the docs from a single port (8788) — nothing else to install. The same code scales to a Postgres + Redis cloud profile for multi-tenant, multi-instance deployments, behind the same storage abstractions.
OpenAI-compatible, 40+ providers. Point Claude Code, Codex, Cursor, Cline or any OpenAI SDK at merido and keep working. It translates every request through a provider-agnostic canonical representation, so OpenAI, Anthropic, Gemini and Responses formats all interoperate behind one endpoint.
Cost-, quota- and latency-aware routing with circuit breakers and health checks spreads load across the accounts you own and fails over cleanly. Define virtual models that map one alias to a prioritized list of provider targets with fallback tiers.
Tool-output compression. merido can shrink bulky tool results substantially depending on the command, losslessly — removing a major, repeatedly-billed source of input tokens before it enters context.
A built-in dashboard and Token-Optimization Advisor surface burn rate, usage and concrete suggestions, with a savings ledger that reports measured numbers against a baseline — and shows nothing when it cannot prove a saving.
Encryption-at-rest protects stored credentials with a master key that stays on your machine.

BYOK · self-hosted · open source

merido uses your own API keys, runs on your own infrastructure, and never pools, shares or resells credentials. It is MIT / Apache-2.0 licensed, so there is no proprietary control plane between you and your providers — your keys, your billing, your machine.

Written in Rust for low overhead and stable performance under sustained, streaming load, merido is designed to be the boring, dependable piece of infrastructure a gateway should be. Read the documentation to see how routing, virtual models and the dashboard fit together.

Run your own LLM gateway in minutes

Open source, single self-hosted binary, on your own keys. Point your existing clients at one endpoint and keep working.

Get started →Read the docs

Related guides

How to reduce Claude Code costs — 8 concrete tactics for the coding CLI.
How to reduce Cursor AI costs — same playbook for Cursor.
Claude Code vs Cursor — a fair comparison.
AI coding cost calculator — estimate your bill.
Open-source AI gateway alternatives — LiteLLM, Portkey, Helicone and merido compared.
Why your AI coding bill explodes — the cost mechanism a gateway attacks.

Frequently asked questions

What is a self-hosted LLM gateway?

A proxy you run on your own infrastructure that sits between your apps or coding CLIs and the LLM providers behind them. It exposes one stable, OpenAI-compatible API, routes each request, handles failover, and adds observability, cost controls and token optimization — without your keys or traffic passing through a third-party cloud.

Why run my own gateway instead of a cloud SaaS?

To keep keys and request data on infrastructure you control, avoid the per-request markup hosted routers charge, and sidestep vendor lock-in. It is also the ToS-clean way to do BYOK: you call providers directly with your own credentials instead of pooling or sharing keys through someone else's account.

Is merido a LiteLLM alternative?

It does the same core job — one OpenAI-compatible endpoint in front of many providers — and adds cost/latency-aware routing with circuit breakers, tool-output compression, a dashboard and Token-Optimization Advisor, and encryption-at-rest. It ships as a single Rust binary that runs from embedded SQLite locally and scales to Postgres + Redis.

Do I have to change my existing clients?

No. merido speaks an OpenAI-compatible API, so most SDKs and coding CLIs just repoint their base URL and keep working. It translates between OpenAI, Anthropic, Gemini and Responses formats internally. Get started here.