Many AI Agent tutorials propose the same fix for bad output: reflection. Your agent generates garbage JSON? Just add another LLM call to “review” it. The second call critiques the first, the first tries again, and voilà — quality improves. It seems clean, elegant, and academic.
Well, I’ve shipped agents to production at a large-scale web company — systems that generated deployment configs, API payloads, database queries. And I can tell you from painful experience: reflection doesn’t work for structured output. Not reliably, and not when it actually matters.
Here’s what happens in practice. Your agent generates JSON. It’s wrong about a third of the time, with missing fields, wrong types, and violated business rules. You add a reflection step because that’s what the tutorials say. Now it fails one in six times.
This sounds like progress until you realize that those remaining failures are invisible. The reflection step said “looks good!” and waved them through. You’ve built a system that’s confidently wrong, and you won’t know until something breaks in production at 2am on a Saturday.
I spent weeks debugging this loop before I found a pattern that actually works. It’s embarrassingly simple, it gets me near-perfect correctness, and it doesn’t require any clever reflection prompts. Let me show you.
Prerequisites
To get the most out of this article, you should be familiar with:
- Basic Python (functions, dictionaries, type hints)
- How LLM APIs work at a high level (sending a prompt, getting a completion back)
- What a JSON Schema is (you don’t need to be an expert — the code explains itself)
The Problem with Reflection
My take: asking an LLM to critique another LLM’s structured output is like asking someone who’s bad at math to grade someone else who’s bad at math. They’d likely have the same or similar blind spots. The same weights that produced the error are now being asked to detect the error. Why would they suddenly get it right on the second pass?
Think about what you’re actually asking the model to do during a reflection step. “Hey, look at this JSON you just generated. Does timeout_seconds need to be less than interval_seconds? Are the replicas and CPU limits consistent with the business rules I listed in the system prompt?”
The model reads it over, pattern-matches against what “looks right,” and says “yep, all good.” It missed that constraint during generation. It’s going to miss it during review too, because it’s the same model doing the same kind of reasoning.
The failure mode that kept biting me wasn’t wrong output — it was approved wrong output. False positives. The reflection step says “this configuration is correct” when it absolutely isn’t.
A system that says “I failed, try again” is annoying but safe. A system that says “this is correct” when it’s broken? That’s the config that sails through your pipeline and takes down your service. That’s a 2am page.
Reflection works beautifully for open-ended stuff — improving the tone of an email, catching logical gaps in an essay, suggesting a better structure for a blog post. But for structured output with hard constraints? You need something that doesn’t guess. You need something deterministic.
The Fix: Deterministic Validation
The pattern for the fix is dead simple:
Generate → Validate with a real validator → Feed exact errors back → Retry.
That’s it. No second LLM call to “critique.” No chain-of-thought reasoning about correctness. Just a function that returns true or false with specific error strings — the same kind of validator you’d write for a form submission or an API request.
Here’s the key insight, and honestly it’s the whole article in one sentence: LLMs are excellent at fixing errors when you tell them exactly what’s wrong. They’re terrible at finding their own errors.
When you tell a model “your output had these specific errors: timeout_seconds must be < interval_seconds, replicas > 5 requires cpu_limit >= 1.0”, it fixes both on the next try almost every time.
The fixing is trivial. The finding is the hard part. And with this technique, you’re outsourcing that to a deterministic function that’s perfect at it, every time, in microseconds. There are no hallucinations and you don’t get “confident but wrong” responses. Just pass or fail with an exact reason why.
What the Validator Actually Catches (and Why LLMs Can’t)
A deterministic validator checks errors at three levels, and each one exploits something LLMs are fundamentally bad at:
1. Structural errors
Is the output even valid JSON? Are all required fields present? Are types correct (string vs. integer vs. array)? JSON Schema handles this in microseconds.
An LLM “reviewing” the same output might glance at the structure and say “looks like valid JSON” without actually parsing it. The validator parses it. There’s no “looks like”. It either passes or it doesn’t.
2. Constraint violations
Is replicas within the allowed range of 1–20? Does service_name match the regex ^[a-z][a-z0-9-]*$? Is memory_limit_mb at least 128?
These are boundary checks. LLMs are notoriously bad at precise numerical comparisons and regex matching. They approximate, while a validator evaluates them exactly.
3. Cross-field business rules
This is where reflection fails hardest. Rules like “if replicas > 5, then cpu_limit must be >= 1.0” or “timeout_seconds must be strictly less than interval_seconds” require holding two values in mind and applying a specific logical relationship.
These rules don’t exist in the training data as patterns the model can pattern-match against. They’re your rules, specific to your system. The LLM has no reason to “know” them beyond what’s in the prompt, and prompts get lost in long contexts.
Here’s why the validator wins at all three: it doesn’t reason — it executes. There’s no interpretation, attention window, or chance of skipping a constraint because something earlier in the context was more salient. Every rule runs every time, in order, deterministically.
The LLM’s job, by contrast, is to generate: to produce something that looks right based on patterns. That’s a fundamentally different skill than verifying that every constraint in a spec is satisfied. You wouldn’t ask a novelist to proofread a tax return. Don’t ask a generator to validate its own output.
The Code
Here’s the full pattern in LangGraph: the validator, the nodes, and the graph with conditional routing. The complete runnable example — schema, validator, the loop, and tests — is on GitHub.
First, the schema and the validator — this is your real source of truth:
from jsonschema import validate, ValidationError
DEPLOYMENT_CONFIG_SCHEMA = {
"type": "object",
"required": ["service_name", "replicas", "resources", "health_check"],
"properties": {
"service_name": {"type": "string", "pattern": "^[a-z][a-z0-9-]*$"},
"replicas": {"type": "integer", "minimum": 1, "maximum": 20},
"resources": {
"type": "object",
"required": ["cpu_limit", "memory_limit_mb"],
"properties": {
"cpu_limit": {"type": "number", "minimum": 0.1, "maximum": 8.0},
"memory_limit_mb": {"type": "integer", "minimum": 128, "maximum": 16384},
},
},
"health_check": {
"type": "object",
"required": ["path", "timeout_seconds", "interval_seconds"],
"properties": {
"path": {"type": "string", "pattern": "^/"},
"timeout_seconds": {"type": "integer", "minimum": 1},
"interval_seconds": {"type": "integer", "minimum": 5},
},
},
},
}
# The validator: your REAL source of truth. This is the hard part.
def validate_config(config: dict) -> tuple[bool, list[str]]:
"""Schema validation + business rules. This IS your spec."""
errors = []
try:
validate(instance=config, schema=DEPLOYMENT_CONFIG_SCHEMA)
except ValidationError as e:
errors.append(f"Schema: {e.message} (at {list(e.path)})")
return False, errors # bail early — no point checking rules on broken structure
# Cross-field rules that JSON Schema can't express
if config["replicas"] > 5 and config["resources"]["cpu_limit"] < 1.0:
errors.append(f"replicas={config['replicas']} requires cpu_limit >= 1.0")
if config["health_check"]["timeout_seconds"] >= config["health_check"]["interval_seconds"]:
errors.append("timeout_seconds must be < interval_seconds")
return len(errors) == 0, errors
Now the LangGraph loop that wires generation to that validator:
import json
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
SYSTEM_PROMPT = ("You generate deployment configs as valid JSON. "
"Required fields: service_name, replicas, resources, health_check. "
"Follow ALL constraints exactly. Return ONLY the JSON object.")
class AgentState(TypedDict):
user_request: str
generated_config: dict
validation_errors: list[str]
attempts: int
is_valid: bool
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def generate_node(state: AgentState) -> AgentState:
messages = [SystemMessage(content=SYSTEM_PROMPT)]
if state["validation_errors"]:
error_feedback = "\n".join(f"- {e}" for e in state["validation_errors"])
messages.append(HumanMessage(content=(
f"Previous attempt failed validation. Fix these specific errors:\n{error_feedback}\n\n"
f"Original request: {state['user_request']}"
)))
else:
messages.append(HumanMessage(content=state["user_request"]))
response = llm.invoke(messages)
try:
config = json.loads(response.content)
except json.JSONDecodeError:
config = {}
return {**state, "generated_config": config, "attempts": state["attempts"] + 1}
def validate_node(state: AgentState) -> AgentState:
is_valid, errors = validate_config(state["generated_config"])
return {**state, "is_valid": is_valid, "validation_errors": errors}
def should_continue(state: AgentState) -> str:
if state["is_valid"]:
return "done"
if state["attempts"] >= 3:
return "done" # give up after 3 attempts
return "retry"
# Wire the graph
graph = StateGraph(AgentState)
graph.add_node("generate", generate_node)
graph.add_node("validate", validate_node)
graph.set_entry_point("generate")
graph.add_edge("generate", "validate")
graph.add_conditional_edges("validate", should_continue, {"retry": "generate", "done": END})
app = graph.compile()
Why This Works So Well
The loop above solves the problem cleanly for three reasons.
First, errors are precise. Instead of vague feedback like “the config might have some issues,” the model gets replicas=8 requires cpu_limit >= 1.0. It knows exactly what to fix.
Second, the validator never guesses. Every rule executes every time. There’s no prompt attention, no context window drift. A constraint check on attempt 1 is identical to the same check on attempt 3.
Third, the retry loop is short. Most correct generations happen on attempt 1. When errors occur, attempt 2 fixes them with the exact error message. Attempt 3 is a rare safety net. You’re not chaining five LLM calls to do what one validator call does in microseconds.
When Three Attempts Isn’t Enough
Three attempts handles the vast majority of real-world cases — when the model understands the domain and the errors are fixable. But there are situations where three retries still won’t converge.
If your schema is genuinely ambiguous, the model will keep oscillating. It fixes one constraint and breaks another because the rules conflict or are underspecified. This isn’t a retry problem — it’s a schema design problem. The fix is to clarify the spec, not add more retries.
If the model simply doesn’t have the domain knowledge to satisfy the constraints — for example, generating valid cryptographic parameters or precise numerical outputs that require lookup tables — no amount of error feedback will help. You need to either simplify the constraints or pre-compute those values and inject them into the prompt.
The tell is when you see the same errors repeating across attempts, or the model fixing error A while reintroducing error B. That’s not a loop count problem. That’s a signal to step back and look at whether your constraints are actually achievable by generation.
When to Use This (and When Not To)
This pattern is the right tool when:
- Your output has a defined schema (JSON, XML, structured text)
- There are hard constraints that must be satisfied: types, ranges, required fields, cross-field rules
- Correctness is binary: either the config is valid or it isn’t
- The cost of a bad output getting through is high (deploys, API calls, data writes)
It’s overkill when:
- Your output is open-ended prose, summaries, or creative text
- “Correct” is subjective or requires human judgment
- You’re doing exploratory research where approximate outputs are fine
For open-ended tasks, reflection genuinely helps — a second pass can improve tone, catch logical gaps, or restructure an argument. Use reflection where it works. Use deterministic validation where correctness is non-negotiable.
The Takeaway
The core idea is simple: use the right tool for the job. LLMs are generators. Validators are verifiers. Don’t ask one to do the other’s job.
When your agent’s output has hard constraints, skip the reflection loop. Write a validator that checks exactly what must be true. Feed the exact errors back to the model. Retry up to three times. You’ll get near-perfect correctness, zero false positives, and a system you can actually trust in production — one that fails loudly when it needs to, rather than sailing broken configs through to a 2am incident.