Structured Output With Pydantic for LLM Apps

I spent six months parsing LLM output with regex. Six months of re.search(r'"name":\s*"([^"]+)"', response) and praying. Six months of production alerts at 3 AM because GPT decided to wrap its JSON in a markdown code fence, or add a trailing comma, or hallucinate a field called additionalNotes that didn't exist in my schema. My parser handled 17 edge cases. The LLM found the 18th every single week.

Then I rewrote everything with Pydantic and structured outputs. The regex file — 340 lines of brittle pattern matching — became 12 lines of a Pydantic model. The 3 AM alerts stopped. And I realized that half the "AI is unreliable" complaints I'd been hearing were actually "my parsing is unreliable" complaints.

The Old Way Is Dead

Let me be specific about what I mean by "the old way." Before structured outputs, getting reliable data from an LLM looked like this:

import re
import json

prompt = "Extract the user's name, email, and signup date from this text..."
response = call_llm(prompt + "\nRespond in JSON format.")

# Hope and pray
try:
    # Strip markdown code fences the model might add
    cleaned = re.sub(r'^```json?\n?', '', response.strip())
    cleaned = re.sub(r'\n?```$', '', cleaned)
    # Fix trailing commas (GPT loves these)
    cleaned = re.sub(r',\s*}', '}', cleaned)
    cleaned = re.sub(r',\s*]', ']', cleaned)
    data = json.loads(cleaned)
except json.JSONDecodeError:
    # Retry with a sterner prompt
    response = call_llm(prompt + "\nYou MUST respond with valid JSON only. No explanation.")
    data = json.loads(response)  # Still might fail

name = data.get("name", data.get("Name", data.get("user_name", "")))

That last line is the tell. When you're checking three different key variations because you can't trust the model to be consistent, you don't have a structured output — you have a structured hope.

And the reliability numbers back this up. Prompt engineering alone gets you 80-95% valid JSON. Function calling bumps that to 95-99%. But native structured outputs with constrained decoding? 100% schema-valid output, guaranteed. The model literally cannot produce a non-conforming response because invalid tokens are masked before generation.

This isn't an incremental improvement. It's the difference between "usually works" and "always works." And in production, that gap is everything.

How Structured Outputs Actually Work Under the Hood

The magic behind structured outputs is constrained decoding — and it's more elegant than you'd expect.

When an LLM generates text, it produces a probability distribution over its entire vocabulary at each step. Normally, any token can be chosen. With constrained decoding, a logit processor sits between the model's output and the sampling step. It tracks the current position within a target grammar (like a JSON Schema) and masks out tokens that would produce invalid output.

XGrammar, which powers several production implementations in 2025-2026, splits vocabulary tokens into context-independent tokens (~99% of the vocabulary, precomputed into bitmask tables) and context-dependent tokens (~1%, requiring runtime inspection). This means the overhead is near-zero — you get guaranteed structure at practically the same speed as unconstrained generation.

OpenAI's GPT-5.2 uses a Context-Free Grammar engine to enforce 100% compliance. Anthropic's Claude models now support structured outputs as a GA feature with no beta header required. Google Gemini has had it since 2024. The provider ecosystem has converged — structured output is table stakes in 2026.

The Three Approaches (And When to Use Each)

There are three ways to get structured data from LLMs in 2026. They're not interchangeable.

1. Native Provider Structured Outputs

This is the simplest path. You pass a JSON Schema or Pydantic model directly to the API, and the provider handles everything.

OpenAI — the .parse() method:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class UserProfile(BaseModel):
    name: str
    email: str
    signup_date: str
    plan: str

response = client.responses.parse(
    model="gpt-4o",
    input=[
        {"role": "system", "content": "Extract user profile information from the text."},
        {"role": "user", "content": "John Smith signed up on March 15 with john@example.com on the Pro plan."},
    ],
    text_format=UserProfile,
)

profile = response.output_parsed
print(profile.name)    # "John Smith"
print(profile.email)   # "john@example.com"
print(profile.plan)    # "Pro"

The SDK takes your Pydantic model, converts it to a JSON Schema, sends it to the API, and automatically parses the response back into your model. No regex. No json.loads(). No hope.

Anthropic — structured outputs via output_config:

import anthropic
from pydantic import BaseModel

client = anthropic.Anthropic()

class UserProfile(BaseModel):
    name: str
    email: str
    signup_date: str
    plan: str

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "John Smith signed up on March 15 with john@example.com on the Pro plan."}
    ],
    output_config={
        "format": {
            "type": "json_schema",
            "json_schema": UserProfile.model_json_schema(),
        }
    },
)

Anthropic's implementation moved from beta to GA for Claude Sonnet 4.5, Opus 4.5, and Haiku 4.5. The output_format parameter moved to output_config.format — if you're using old code with the beta header, update it.

2. Tool Use / Function Calling

Tool use (what OpenAI calls "function calling") predates native structured outputs and takes a different approach. Instead of constraining the model's text output, you define tools with parameter schemas, and the model returns a structured tool call.

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "extract_profile",
        "description": "Extract user profile information from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Full name"},
                "email": {"type": "string", "description": "Email address"},
                "signup_date": {"type": "string", "description": "Date of signup in YYYY-MM-DD"},
                "plan": {"type": "string", "enum": ["free", "pro", "enterprise"]},
            },
            "required": ["name", "email", "signup_date", "plan"],
        },
    }
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "extract_profile"},
    messages=[
        {"role": "user", "content": "John Smith signed up on March 15 with john@example.com on the Pro plan."}
    ],
)

# Access structured data from tool call
tool_input = response.content[0].input
print(tool_input["name"])  # "John Smith"

The difference: tool use was designed for actions — "search the database," "send an email," "create a ticket." Structured output was designed for data extraction — "give me this information in this format." Both produce structured JSON, but the mental model and the API ergonomics are different.

3. Instructor (The Swiss Army Knife)

Instructor is the library that convinced me Pydantic + LLMs was the right abstraction. It has over 3 million monthly downloads and 11k GitHub stars, supports 15+ providers, and does one thing extremely well: it turns any LLM call into a validated Pydantic object with automatic retries.

import instructor
from openai import OpenAI
from pydantic import BaseModel, field_validator

client = instructor.from_openai(OpenAI())

class UserProfile(BaseModel):
    name: str
    email: str
    signup_date: str
    plan: str

    @field_validator("email")
    @classmethod
    def validate_email(cls, v):
        if "@" not in v:
            raise ValueError("Invalid email format")
        return v.lower()

    @field_validator("plan")
    @classmethod
    def validate_plan(cls, v):
        valid_plans = {"free", "pro", "enterprise"}
        if v.lower() not in valid_plans:
            raise ValueError(f"Plan must be one of {valid_plans}")
        return v.lower()

profile = client.chat.completions.create(
    model="gpt-4o",
    response_model=UserProfile,
    messages=[
        {"role": "user", "content": "John Smith signed up on March 15 with john@example.com on the Pro plan."}
    ],
)

print(profile.name)   # "John Smith"
print(profile.email)  # "john@example.com"
print(profile.plan)   # "pro" (lowercased by validator)

The key feature most people miss: automatic retries with validation feedback. When Pydantic validation fails — say the model returns an invalid email — Instructor sends the validation error back to the model as context and asks it to fix the output. This retry loop handles the last 1-5% of failures that even native structured outputs can't prevent (like a model returning "N/A" for a required field).

When to Use What

Scenario	Best Approach	Why
Simple data extraction	Native structured output	Simplest, guaranteed schema compliance
Complex validation rules	Instructor + Pydantic	Validators + retry loop handle edge cases
Agent tool execution	Tool use / function calling	Designed for actions, not just data
Multi-provider app	Instructor	One API across 15+ providers
Streaming structured data	Tool use (Anthropic) or Instructor	Fine-grained streaming support
Ollama / local models	Instructor or Outlines	Native support varies by model

Pydantic Patterns That Actually Matter

Pydantic hit 10 billion downloads in early 2026, growing from 40M monthly downloads in 2023 to over 550M per month. It's the backbone of FastAPI, LangChain, Instructor, and most of the modern Python AI stack. And for LLM structured output, the Pydantic model is the prompt.

Here are the patterns I use constantly.

Pattern 1: Field Order Is Prompt Order

LLMs generate left-to-right. The order of fields in your Pydantic model affects the quality of extraction. Put reasoning fields first so the model thinks before committing:

class SentimentAnalysis(BaseModel):
    """Analyze the sentiment of the given text."""
    reasoning: str  # Model thinks through this FIRST
    sentiment: str   # Then commits to a label
    confidence: float  # Confidence is informed by reasoning

This is the "chain of thought in the schema" pattern. Moving reasoning after sentiment measurably degrades quality because the model has already committed to an answer before justifying it.

Pattern 2: Make Optional Fields Optional

If a field might not exist in the source text, mark it Optional. Forcing required fields when data doesn't exist leads to hallucination — the model will invent data rather than fail validation:

from typing import Optional

class ContactInfo(BaseModel):
    name: str
    email: Optional[str] = None
    phone: Optional[str] = None
    company: Optional[str] = None

Pattern 3: Use Enums for Constrained Choices

When the output must be one of a known set of values, use Python enums or Literal:

from typing import Literal

class TicketClassification(BaseModel):
    priority: Literal["low", "medium", "high", "critical"]
    category: Literal["billing", "technical", "account", "feature_request"]
    summary: str

This is more reliable than asking the model to pick from a list in the prompt. The schema constrains the actual token generation, not just the model's intention.

Pattern 4: Nested Models for Complex Extraction

Don't try to flatten everything into one model. Use composition:

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class OrderItem(BaseModel):
    product_name: str
    quantity: int
    unit_price: float

class OrderExtraction(BaseModel):
    customer_name: str
    shipping_address: Address
    items: list[OrderItem]
    total: float
    order_date: str

Each nested model is validated independently. If the address parsing fails, you get a specific error pointing to the address model, not a vague "JSON parse error."

Pattern 5: One Schema Per Task

Don't try to extract everything in one massive schema. If you need 50+ fields, split into multiple extraction calls. Smaller, focused schemas produce higher-quality results. The model has a finite attention budget — a 50-field schema dilutes its focus across too many targets.

Pattern 6: Descriptions Are Part of the Prompt

Most people leave field descriptions blank. Don't. The field description in your Pydantic model is literally injected into the prompt as part of the JSON Schema. Write them like you're explaining to a junior developer what this field means:

class InvoiceExtraction(BaseModel):
    vendor_name: str = Field(description="Company name of the vendor, not the buyer")
    invoice_number: str = Field(description="Invoice ID/number, usually alphanumeric like INV-2024-001")
    total_amount: float = Field(description="Final total including tax, in USD. Do not include currency symbols.")
    line_items: list[LineItem] = Field(description="Individual items or services billed. Each must have description, quantity, and unit price.")

Good descriptions reduce hallucination because they narrow the model's interpretation. "Company name of the vendor, not the buyer" eliminates a common confusion point that vague field names like company would leave ambiguous. I've seen accuracy improvements of 10-15% just from adding precise field descriptions — no prompt changes needed.

Pattern 7: Streaming Structured Output

For long extractions, you don't want to wait for the entire response. Both Instructor and native provider APIs support streaming structured output, where you get partial objects as they're generated:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

for partial in client.chat.completions.create_partial(
    model="gpt-4o",
    response_model=OrderExtraction,
    messages=[{"role": "user", "content": long_document}],
):
    print(f"Customer: {partial.customer_name}")  # Available early
    print(f"Items so far: {len(partial.items or [])}")

Anthropic's fine-grained tool streaming is now GA across all models — you can stream tool parameters without buffering or JSON validation, cutting time-to-first-token significantly for large structured responses.

The Migration Path: From Regex to Structured Output

If you're sitting on a codebase full of regex parsers and json.loads() calls, here's the migration path I followed.

Week 1: Audit your parsing code. Search for re.search, re.findall, json.loads, and any custom parsing functions. List every place you're extracting structured data from LLM responses. I found 23 in our codebase. You'll be surprised how many there are.

Week 2: Define Pydantic models. For each extraction point, create a Pydantic model that represents the expected output. Don't add validators yet — just get the field names and types right. This is your schema library.

Week 3: Switch to Instructor. Replace json.loads(response) calls with client.chat.completions.create(response_model=YourModel). If you're using multiple providers, Instructor handles the abstraction. One by one, endpoint by endpoint.

# Before
response = client.chat.completions.create(model="gpt-4o", messages=[...])
data = json.loads(response.choices[0].message.content)  # Might fail
name = data.get("name", "")  # Might be wrong key

# After
profile = client.chat.completions.create(
    model="gpt-4o",
    response_model=UserProfile,
    messages=[...]
)
name = profile.name  # Type-safe, validated, guaranteed

Week 4: Add validators and monitoring. Now that the basic extraction works, add Pydantic validators for business rules (@field_validator). Add logging to track retry rates — if Instructor is retrying more than 5% of the time, your schema might be too complex or your prompt needs work.

Expected results: In our case, parsing-related production incidents dropped from ~4 per week to zero. Retry rate settled at 2.3%. Total migration time: 3 weeks for a mid-size codebase with 23 extraction points.

What Most Guides Get Wrong

Most structured output guides treat this as a provider feature comparison. "OpenAI does it this way, Anthropic does it that way." That's the wrong framing.

The real insight is that structured output is an architecture pattern, not a provider feature. It changes how you design LLM applications at every layer:

Prompt design changes. Your system prompt no longer needs to beg for JSON. You don't need "You MUST respond in the following format..." paragraphs. The schema handles format enforcement. Your prompt can focus on what to extract and how to think about it, not what shape the output should be.

Error handling changes. Instead of catching JSONDecodeError and retrying blindly, you catch ValidationError from Pydantic and get specific, actionable feedback: "field 'email' is not a valid email address." That error message goes back to the model in the retry loop, and the model fixes the specific problem.

Testing changes. You can test your extraction logic with unit tests on Pydantic models. Feed them known inputs, check the outputs. No mocking LLM responses with carefully crafted JSON strings. The model boundary is clean.

Type safety changes. Your IDE knows the types. Autocomplete works. Refactoring is safe. You're working with Python objects, not dictionaries with string keys that might or might not exist.

Observability changes. When every LLM output is a validated Pydantic object, you can log structured data instead of raw strings. Your monitoring dashboards can track field-level extraction accuracy. You can detect drift — "the model stopped extracting phone numbers correctly last Tuesday" — because you have typed fields to monitor, not blobs of text to eyeball.

Cost changes. Structured output eliminates retry loops caused by malformed JSON. In our old regex-based system, 8% of calls required at least one retry. That's 8% wasted API spend, plus the latency tax on every retried request. With constrained decoding, the retry rate for format issues dropped to zero. Retries still happen for validation failures (wrong data, not wrong format), but those are typically under 3%.

What I Actually Think

Structured output is the single most important pattern in LLM engineering right now. Not RAG. Not fine-tuning. Not agents. Structured output. Because every other pattern depends on it.

RAG systems need to parse retrieved documents into structured formats. Agents need to produce structured tool calls. Multi-step chains need structured intermediate results. If your extraction layer is unreliable, everything built on top of it is unreliable.

I think Instructor is the best tool for most teams. It's provider-agnostic, it works with Pydantic models you probably already have, and the retry-with-validation-feedback loop solves the last-mile reliability problem that even native structured outputs can't handle. The 3 million monthly downloads aren't hype — they reflect genuine utility.

I think native provider structured outputs (OpenAI's .parse(), Anthropic's output_config) are the right choice when you're locked to a single provider and want zero dependencies. They're simpler, faster, and guaranteed at the token level. But the moment you need custom validation or multi-provider support, you're back to Instructor.

I think the biggest mistake teams make is treating structured output as optional. They start with string parsing, plan to "clean it up later," and never do. By the time they have 30 regex parsers scattered across their codebase, the migration feels impossible. Start with structured output on day one. It takes 12 lines of code. There is no good reason to parse LLM output with regex in 2026.

Pydantic AI — the agent framework from the Pydantic team — had 8 million downloads per month in 2025, making it the fastest-growing agent framework by downloads. That tells you where the ecosystem is headed: Pydantic as the interface layer between your application and every LLM. Schema-first development, where the Pydantic model is the contract between human intent and machine output.

The regex parsing era is over. The JSON-and-pray era is over. If you're still doing either, stop. Define a Pydantic model, use Instructor or native structured outputs, and move on to solving actual problems.

Structured Output Changed How I Build LLM Apps — Pydantic, Tool Use, and the End of Regex Parsing

The Old Way Is Dead

How Structured Outputs Actually Work Under the Hood

The Three Approaches (And When to Use Each)

1. Native Provider Structured Outputs

2. Tool Use / Function Calling

3. Instructor (The Swiss Army Knife)

When to Use What

Pydantic Patterns That Actually Matter

Pattern 1: Field Order Is Prompt Order

Pattern 2: Make Optional Fields Optional

Pattern 3: Use Enums for Constrained Choices

Pattern 4: Nested Models for Complex Extraction

Pattern 5: One Schema Per Task

Pattern 6: Descriptions Are Part of the Prompt

Pattern 7: Streaming Structured Output

The Migration Path: From Regex to Structured Output

What Most Guides Get Wrong

What I Actually Think

Sources

Enjoyed this article?

The Old Way Is Dead

How Structured Outputs Actually Work Under the Hood

The Three Approaches (And When to Use Each)

1. Native Provider Structured Outputs

2. Tool Use / Function Calling

3. Instructor (The Swiss Army Knife)

When to Use What

Pydantic Patterns That Actually Matter

Pattern 1: Field Order Is Prompt Order

Pattern 2: Make Optional Fields Optional

Pattern 3: Use Enums for Constrained Choices

Pattern 4: Nested Models for Complex Extraction

Pattern 5: One Schema Per Task

Pattern 6: Descriptions Are Part of the Prompt

Pattern 7: Streaming Structured Output

The Migration Path: From Regex to Structured Output

What Most Guides Get Wrong

What I Actually Think

Sources