Guardrails: Validate LLM Output Before It Reaches Your Users
Photo: Unsplash
Here's a rule that will save you an incident: the output of a language model is untrusted input. You wouldn't render a raw form submission straight into your UI or pass it to your database without validation. An LLM's response deserves exactly the same suspicion — it's just a very fluent, very confident source of untrusted data.
The model that returns perfect JSON 99% of the time will return a markdown code fence, an apology, or a half-finished object on the 1% that hits production at 2 a.m. Guardrails are the layer that catches that before your users do.
What can actually go wrong
Failure modes cluster into a few buckets:
- Schema violations — missing fields, wrong types, an extra prose preamble around the JSON.
- Out-of-policy content — toxic text, PII, or advice your product shouldn't give.
- Injected instructions — a user (or a retrieved document) convinced the model to ignore your system prompt. The OWASP Top 10 for LLM Applications puts prompt injection at #1 for good reason.
- Hallucinated references — links, citations, or API calls that don't exist.
Each needs a different guard. The mistake is assuming a better prompt fixes all of them. Prompts reduce failure rate; they don't make it zero.
Layer 1: Make malformed output impossible at the source
The cheapest guardrail is constraining generation so invalid output can't be produced. Most providers now support structured outputs that conform to a JSON Schema — OpenAI's structured outputs and tool-calling on Anthropic's API both let you pin a response shape. Use them.
But "valid JSON" is not "valid data." A schema can say quantity is a number; it can't
say it must be positive and under your stock limit. So you still validate on your side:
from pydantic import BaseModel, Field, ValidationError
class Order(BaseModel):
sku: str = Field(pattern=r"^[A-Z]{3}-\d{4}$")
quantity: int = Field(gt=0, le=100)
def parse_order(raw: str) -> Order | None:
try:
return Order.model_validate_json(raw)
except ValidationError as e:
log.warning("LLM output failed validation", errors=e.errors())
return NonePydantic is the workhorse here for Python; Zod plays the same role in TypeScript. The point is that the model's output crosses a validation boundary before it touches anything that matters.
Layer 2: Validate semantics, not just shape
Structural validation catches typos. Semantic validation catches lies. If the model returns a citation, check the URL resolves. If it returns a SQL query to run, parse it and confirm it only touches allowed tables and is read-only. If it returns a price, confirm it exists in your catalog.
def guard_citation(claim: Claim) -> bool:
# The model can hallucinate a plausible-looking source — verify it exists.
return claim.url in KNOWN_SOURCE_URLS
vetted = [c for c in response.claims if guard_citation(c)]Never let an LLM's output flow directly into a privileged action — a database write, a shell command, an email send — without a deterministic check in between. The model proposes; your code disposes.
Layer 3: Content and safety filters
For user-facing text, run output through a moderation step. Provider moderation endpoints (such as OpenAI's moderation API) catch the obvious categories cheaply. For domain rules — "never give medical dosages," "never reveal another tenant's data" — you'll need your own classifiers or rules, because generic moderation doesn't know your policy.
A practical pattern is a small, fast secondary model acting as a judge:
System: You are a policy checker. Given an assistant reply, return
{ "allowed": boolean, "reason": string }. Block replies that contain
account numbers, internal hostnames, or instructions to bypass auth.It's not perfect, but defense in depth means each layer only has to catch what slipped past the last one.
Layer 4: Fail closed, then recover
When a guard trips, decide the fallback deliberately:
- Retry with the validation error fed back to the model ("your JSON was missing
sku"). One retry fixes most transient schema misses. - Repair deterministically when you can (strip a markdown fence, coerce a string to a number).
- Degrade to a safe default or a templated "I can't help with that" — never a stack trace.
for attempt in range(2):
raw = call_model(prompt, schema=Order)
order = parse_order(raw)
if order:
return order
prompt = add_error_feedback(prompt, raw) # tell it what was wrong
return fallback_response() # fail closedThe worst outcome isn't a blocked response — it's a malformed one that looks fine and corrupts state downstream.
Don't forget observability
Guardrails are only useful if you know when they fire. Log every validation failure, every blocked response, and every retry, with enough context to reproduce. Those logs are also your eval set: today's guardrail trip is tomorrow's regression test. If you're tracking quality with a framework, wire guard outcomes into it the same way you'd track test results.
Takeaways
- Treat LLM output as untrusted input — validate it before it touches anything privileged.
- Constrain generation with structured outputs, then validate shape and semantics in your own code.
- Add content/policy filters for user-facing text; a small judge model is a useful second layer.
- Fail closed: retry with feedback, repair deterministically, or degrade to a safe default.
- Log every guardrail event — it's both your alarm and your future eval set.

