Guardrails

Guardrails are merido's request/response safety layer. Every call that flows through the gateway can be scanned — on the way in (untrusted user / tool text) and on the way out (model output) — and blocked, redacted, or flagged before it reaches the model or the client. All verdicts emit privacy-safe audit events, so you get a compliance-evidence trail without ever logging the sensitive content itself.

Guardrails are off by default — a request pays zero overhead until you turn a rail on.

The rails

Rail	Scans	Modes	Notes
PII	request	`off` · `redact` · `block`	Reversible redaction: the model sees placeholders, the client gets the real values back. Detects email, phone, SSN, ITIN, credit card (Luhn), IBAN (mod-97), IP, MAC.
Prompt injection	request	`off` · `warn` · `block`	Rule-based scoring (OWASP LLM01) with anti-obfuscation normalization. See the limitation note below.
Secret leak	response	`off` · `redact` · `terminate`	Detects 11 vendor key shapes + high-entropy generic; works on streaming deltas.
Custom rules	request + response	per-rule `warn` · `redact` · `block`	Your own keyword / regex denylists (banned words, competitor names, internal codenames, custom secret shapes).
Moderation	request	`off` · `warn` · `block`	Toxicity / hate / sexual / self-harm via an OpenAI-`/v1/moderations`-compatible endpoint.
External providers	request	`off` · `warn` · `redact` · `block`	Plug in Microsoft Presidio (PII), Lakera Guard (prompt safety), or a generic webhook.

The built-in PII, injection, and secret rails are pure logic — no model call, no network — so they are deterministic and add negligible latency. The moderation and external-provider rails make an HTTP call and therefore fail open: if the classifier is unreachable or errors, the request passes through rather than failing.

Configuring rails

The PII / injection / secret rails and the master switch are global settings, editable from Govern → Guardrails in the dashboard or via PUT /api/settings:

guardrails.enabled   = true | false      # master kill-switch
guardrails.pii       = off | redact | block
guardrails.injection = off | warn | block
guardrails.secrets   = off | redact | terminate

Scoping

Each rail resolves from the most specific override to the least, so you can set a strict default and relax (or tighten) it per route or per gateway key:

key.<key_id>.guardrails.<rail>     # most specific  (per gateway key)
route.<route>.guardrails.<rail>    # per requested model / route
guardrails.<rail>                  # global default
off                                # when unset

The kill-switch follows the same chain: an explicit …guardrails.enabled = "false" at any level disables every rail at and below that scope, and an explicit "true" at a more-specific level overrides a "false" further out. Both Per-route overrides and Per-key overrides have their own tables on the dashboard Guardrails page (the per-key table lets you pick a gateway key and give it its own policy); they map to the route.* / key.* settings keys above.

Custom rules

Custom rules let you enforce your own policies without code. Each rule has a matcher (case-insensitive keyword list with optional word-boundary, or a regex), a scope (input / output / both), and an action (warn / redact / block). When several rules match one text the strongest action wins (block > redact > warn). An optional applies_to scope narrows a rule to specific routes and/or gateway keys.

Manage them under Govern → Guardrails → Custom rules, or via /api/guardrails/rules (CRUD, admin-gated). Invalid regex is rejected at save time. Regex patterns run on the linear-time, ReDoS-safe engine, so a user-supplied pattern can't hang the gateway.

Custom rules on responses currently apply to non-streaming replies; streaming responses still run PII restore + secret scanning, but not custom output rules.

Moderation

Point the moderation rail at any OpenAI-/v1/moderations-compatible endpoint:

guardrails.moderation            = off | warn | block
guardrails.moderation.endpoint   = https://api.openai.com/v1/moderations
guardrails.moderation.api_key    = <secret>          # stored masked, write-only
guardrails.moderation.model      = omni-moderation-latest
guardrails.moderation.threshold  = 0.5               # score 0–1
guardrails.moderation.categories = violence, self-harm   # empty = all categories

A request is flagged when any enforced category's score reaches the threshold. The rail only runs when the mode is not Off and an API key is set.

External providers

Register best-in-class external services and have them called during the request scan, under Govern → Guardrails → External providers or via /api/guardrails/providers:

Presidio — Microsoft's PII analyzer (/analyze); supports redact (spans are masked) as well as warn / block.
Lakera Guard — prompt-injection / safety; warn / block.
Webhook — any service that returns the normalized { "flagged": bool, "findings": [{ "label", "start", "end", "score" }] }.

API keys are encrypted at rest and never returned by the API. Each call fails open. (AWS Bedrock Guardrails is not yet supported — it requires SigV4 signing.)

Activity & audit

Every non-off verdict (block / warn / redact / terminate) emits an audit event whose details contain only counts, kinds, categories, and spans — never the raw PII or secret. The Activity panel on the Guardrails page rolls these up per rail over a 24h / 7d / 30d window with a recent-events feed, and each event deep-links to the matching request in Monitor → Logs.

Limitation: prompt injection is Warn-grade

The built-in injection rail is a deterministic, rule-based scorer — fast and zero-dependency, but it can be bypassed by paraphrasing or translation. Treat it as a detection signal, not a hard security boundary. For stronger coverage, layer an external provider (e.g. Lakera Guard) or the moderation rail on top, and keep your trusted system prompt out of untrusted channels (the rail never scans the operator's system prompt, only user / tool content).

Guardrails ​

The rails ​

Configuring rails ​

Scoping ​

Custom rules ​

Moderation ​

External providers ​

Activity & audit ​

Limitation: prompt injection is Warn-grade ​