LLM Evals Are Broken: How to Test Your AI App

Last March, we shipped a customer support chatbot that passed every test we threw at it. Unit tests? Green. Integration tests? Green. Manual QA with 50 sample queries? Flawless. Two weeks later, it confidently told a customer they could get a full refund on a non-refundable subscription — a policy that never existed. The bot hallucinated a refund policy, cited a fake internal document number, and the customer screenshotted it before we could fix it. That screenshot cost us $47,000 in honored refunds before legal shut it down.

We didn't have a testing problem. We had an eval problem. And if you're building LLM applications in 2026, you probably have one too.

The Testing Gap Nobody Talks About

Here's the uncomfortable truth: 65% of companies now use generative AI in at least one business function (McKinsey). The number of LLM-powered applications hit 750 million worldwide in 2025. But ask those same companies how they test their AI outputs and you'll get a lot of blank stares.

Traditional software testing doesn't work for LLMs. When you call add(2, 3), you expect 5. Every time. That's the deal. But when you ask an LLM to "summarize this contract's termination clause," there are thousands of valid outputs — and a few catastrophically wrong ones that look identical to the good ones.

35% of LLM users identify reliability and inaccurate output as their primary concern (Hostinger). The average hallucination rate across major models dropped from 38% in 2021 to about 8.2% in 2026, with the best systems hitting rates as low as 0.7% (Lakera). Sounds great until you realize that on complex reasoning and summarization tasks, false or fabricated information still appears in 5-20% of outputs.

And the domain-specific numbers are terrifying. Stanford HAI researchers found LLMs hallucinate 69% to 88% of the time on legal queries. Even premium legal AI tools from LexisNexis and Thomson Reuters hallucinate 17-34% of the time.

That's not a rounding error. That's a lawsuit waiting to happen.

Why Traditional Testing Falls Apart

I've shipped production software for years. I know how to write unit tests, integration tests, and end-to-end tests. None of those skills transferred cleanly to LLM testing. Here's why.

Non-determinism is the default. Same input, different output. Every single time. Temperature 0 helps but doesn't eliminate it. You can't assertEqual your way out of this. Two perfectly valid summaries of the same document might share zero words in common.

The failure mode is confidence. When traditional software fails, it usually throws an error. When an LLM fails, it sounds more confident. It fabricates sources, invents policies, and delivers wrong answers with the same authoritative tone as correct ones. In April 2026, Cursor's AI assistant told users they were restricted to "one device per subscription" — a policy that never existed. Users cancelled subscriptions before the company admitted it was a hallucination.

The output space is infinite. For a function that returns a boolean, you test true and false. For an LLM that generates text, the output space is effectively infinite. You can't enumerate every possible failure. You have to evaluate qualities of the output, not exact matches.

Regressions are invisible. You update your system prompt to handle edge case A, and it silently breaks edge case B. There's no compiler to catch it. No type system. Just vibes and user complaints.

This is why the industry converged on a different word: evals. Not tests. Evals. The distinction matters.

Evals vs Tests: The Mental Model Shift

A test is binary. Pass or fail. An eval is a measurement. It gives you a score on a spectrum, across multiple dimensions, with statistical confidence.

Anthropic's engineering team put it best: "An evaluation is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success." But unlike traditional tests, they recommend grading outcomes, not paths — checking what the agent produced rather than how it got there. Because agents regularly find valid approaches that eval designers didn't anticipate.

Here's the mental model that actually works:

Aspect	Traditional Test	LLM Eval
Output	Binary pass/fail	Score on a spectrum
Determinism	Same input = same output	Same input = different output
Grading	Exact match or assertion	Rubric-based, multi-dimensional
Sample size	1 run per test	Multiple trials, averaged
Failure detection	Immediate	Statistical, over time
Who grades	Code	Code, LLMs, and humans

The eval-driven development workflow looks like this: define what "good" means for your use case, build a dataset of real inputs, run your LLM against them, grade the outputs across multiple dimensions, and track scores over time. When you change a prompt, you don't ask "does it still work?" — you ask "did the scores go up or down, and on which dimensions?"

Anthropic recommends starting with just 20-50 simple tasks drawn from real failures. Early changes have large effect sizes, so small sample sizes work fine initially. That's the beauty of it — you don't need 10,000 test cases to start getting value.

The Three Tools That Actually Work

I've tried most of the eval tools on the market. Three stand out, and they serve different needs.

Promptfoo: The CLI-First Workhorse

Promptfoo is the tool I reach for first. It's an open-source CLI for evaluating and red-teaming LLM apps, with 25.6k GitHub stars and used by 300,000+ developers including OpenAI and Anthropic themselves. OpenAI acquired Promptfoo in March 2026 but kept it MIT licensed.

The killer feature is simplicity. You define test cases in YAML:

prompts:
  - "Summarize the following contract clause: {{clause}}"

providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-20250514

tests:
  - vars:
      clause: "The tenant may terminate this lease with 30 days written notice..."
    assert:
      - type: llm-rubric
        value: "Output mentions 30-day notice requirement"
      - type: llm-rubric
        value: "Output does not fabricate additional terms"
      - type: cost
        threshold: 0.01

Run promptfoo eval and you get a comparison matrix across models, prompts, and assertions. Its red team module scans for 50+ vulnerability types including prompt injection, jailbreaks, PII leaks, and toxic content. The CI/CD GitHub Action means you can block merges that degrade eval scores.

Best for: Solo developers or small teams who want fast iteration. YAML config means non-engineers can contribute test cases. Zero cloud dependency.

DeepEval: Pytest for LLMs

DeepEval is what you reach for when your team already thinks in pytest. It's a Python-native framework with 50+ built-in metrics that drops straight into your existing test suite:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric

def test_chatbot_no_hallucination():
    test_case = LLMTestCase(
        input="What is our refund policy?",
        actual_output=chatbot.respond("What is our refund policy?"),
        context=["Refunds available within 14 days of purchase for unused items."]
    )
    hallucination = HallucinationMetric(threshold=0.7)
    relevancy = AnswerRelevancyMetric(threshold=0.8)
    assert_test(test_case, [hallucination, relevancy])

The metric library is where DeepEval shines. G-Eval, hallucination detection, answer relevancy, contextual recall, faithfulness, toxicity — 60+ metrics across prompt quality, RAG accuracy, chatbot behavior, and safety. The @observe decorator lets you trace and evaluate individual components within complex LLM systems, not just final outputs.

It also handles the CI/CD story with built-in support for caching, parallelized evaluation, and error ignoring in pipelines.

Best for: Python teams with existing test infrastructure. If you already run pytest in CI, DeepEval is the lowest-friction option.

Braintrust: The Full Lifecycle Platform

Braintrust is a different beast. Where Promptfoo and DeepEval are tools, Braintrust is a platform — and it just raised an $80M Series B at an $800M valuation. Customers include Notion, Replit, Cloudflare, Ramp, Dropbox, and Vercel.

The core pitch: evaluation connected to production. You don't just test prompts before deploy — you trace every LLM call in production, score them automatically, detect drift, and get alerted when quality degrades. Notion reported going from fixing 3 issues per day to 30 after adopting the platform.

import { Eval } from "braintrust";

Eval("customer-support-bot", {
  data: () => loadTestCases(),
  task: async (input) => {
    return await myBot.respond(input.query);
  },
  scores: [Factuality, Relevance, Helpfulness],
});

Braintrust's purpose-built database (Brainstore) claims 80x faster query performance than traditional databases for AI application logs. The playground lets you test prompt changes against real production data before deploying.

Best for: Teams shipping to production who need observability alongside evaluation. If you're past the "experimenting with prompts" phase and into "running LLMs that make real business decisions," this is where you end up.

The Comparison Table

Feature	Promptfoo	DeepEval	Braintrust
Type	CLI tool	Python framework	Full platform
License	MIT (open source)	Apache 2.0 (open source)	Commercial + open SDK
Built-in metrics	Basic + red team	60+ research-backed	Built-in + custom
CI/CD integration	GitHub Action	Pytest plugin	SDK + webhooks
LLM-as-judge	Yes	Yes (G-Eval)	Yes
Production tracing	No	Via Confident AI	Yes (core feature)
Pricing	Free	Free (cloud add-on)	Free tier + paid
Best for	Prompt iteration	Python test suites	Production monitoring

The LLM-as-Judge Trap

Every eval framework offers LLM-as-judge — using one LLM to grade another LLM's output. It's the most popular approach because it scales. Human review costs $20-100 per hour. Automated LLM evaluation processes the same volume for $0.03-15 per million tokens. At 10,000 monthly evaluations, that's a $50,000-100,000 savings versus human review.

But LLM-as-judge has serious failure modes that most teams ignore.

Position bias. GPT-4 shows 40% inconsistency based on the order responses are presented. Put the better answer first, and the judge scores it lower. Swap the order, different result.

Verbosity bias. Longer answers get roughly 15% higher scores regardless of quality. Your model learns to be verbose, not accurate.

Domain blindness. In specialized domains, subject matter experts agree with LLM judges only 64-68% of the time. That's barely better than a coin flip in healthcare and mental health contexts.

Non-determinism compounds. G-Eval — the most popular LLM-as-judge metric — isn't deterministic. Run the same eval twice and you might get different scores. So you're using a non-deterministic system to evaluate a non-deterministic system. The error bars multiply.

The fix isn't to abandon LLM-as-judge. It's to use it correctly:

Run multiple trials. Anthropic recommends multiple trials per task and averaging results. Single-run evals are noise.
Calibrate against humans. Score 100 examples with both your LLM judge and a human expert. If agreement falls below 80%, your rubric needs work.
Isolate dimensions. Don't ask one judge to evaluate "overall quality." Use separate judges for factuality, relevance, tone, and completeness. Isolated judges per dimension produce better results.
Track trends, not absolutes. A single eval score is meaningless. A score that drops 15% after a prompt change is a signal.
Build in partial credit. Anthropic explicitly recommends scoring on a spectrum, not binary pass/fail. A bot that gets the problem right but fumbles the resolution is meaningfully better than one that fails immediately.

Building Your First Eval Pipeline (Practical Guide)

Stop reading blog posts about eval theory. Here's what to actually do, this week.

Step 1: Collect 20 real failures. Go through your logs, support tickets, and Slack messages. Find 20 cases where your LLM app produced bad output. These are your first eval cases. Not synthetic data. Not hypothetical scenarios. Real failures from real users.

Step 2: Define your rubric. For each failure, write down why it was bad. Was it factually wrong? Irrelevant? Tone-deaf? Incomplete? Group these into 3-5 dimensions. Those are your eval criteria.

Step 3: Pick a tool. If you're a solo developer or small team, start with Promptfoo. Write YAML, run from CLI, done. If you're a Python shop with existing pytest infrastructure, use DeepEval. If you're already in production with real traffic and need monitoring, go straight to Braintrust.

Step 4: Write your first eval.

Here's a minimal Promptfoo setup:

# promptfooconfig.yaml
prompts:
  - file://prompts/support-bot.txt

providers:
  - openai:gpt-4o

tests:
  - vars:
      query: "Can I get a refund on my annual plan?"
      context: "Refunds within 14 days only. Annual plans non-refundable after 14 days."
    assert:
      - type: llm-rubric
        value: "Response accurately states refund policy without fabricating terms"
      - type: llm-rubric
        value: "Response does not promise refunds that violate the policy"
      - type: not-contains
        value: "full refund"

Or the DeepEval equivalent:

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric

@pytest.fixture
def refund_test_case():
    return LLMTestCase(
        input="Can I get a refund on my annual plan?",
        actual_output=bot.respond("Can I get a refund on my annual plan?"),
        retrieval_context=[
            "Refunds within 14 days only. Annual plans non-refundable after 14 days."
        ]
    )

def test_refund_faithfulness(refund_test_case):
    metric = FaithfulnessMetric(threshold=0.8)
    assert_test(refund_test_case, [metric])

def test_refund_relevancy(refund_test_case):
    metric = AnswerRelevancyMetric(threshold=0.8)
    assert_test(refund_test_case, [metric])

Step 5: Run it in CI. With Promptfoo, add the GitHub Action to your pipeline. With DeepEval, it's just pytest tests/evals/ in your CI config. Block merges that drop scores below your threshold.

Step 6: Add cases every week. Every production failure becomes a new eval case. Your eval suite should grow organically from real problems, not be designed upfront. After three months, you'll have 80+ cases covering your actual failure modes — worth more than 1,000 synthetic ones.

RAG Evals: The Special Case

If you're building RAG (Retrieval Augmented Generation) applications, you need specialized metrics. RAGAS is the de facto standard here, and DeepEval also implements the core RAGAS metrics natively.

The four metrics that matter for RAG:

Faithfulness: Does the answer stick to what the retrieved documents actually say? This catches hallucination — the model inventing facts not present in the context. This is the metric that would have caught our refund policy disaster.

Answer Relevancy: Does the answer actually address the question? A factually correct answer about the wrong topic is still a failure.

Context Precision: Did the retriever pull the right documents? If your top-3 retrieved chunks don't contain the answer, even a perfect LLM can't help.

Context Recall: Did the retriever find all the relevant documents? Missing a critical document means missing critical information.

from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
)

metrics = [
    FaithfulnessMetric(threshold=0.8),
    AnswerRelevancyMetric(threshold=0.7),
    ContextualPrecisionMetric(threshold=0.7),
    ContextualRecallMetric(threshold=0.7),
]

One gotcha with RAGAS-style metrics: NaN scores appear when the LLM judge returns invalid JSON during metric calculation, with no graceful fallback. One malformed judge response tanks your entire eval run. Always wrap RAGAS evals in retry logic and handle NaN explicitly.

Red Teaming: The Eval You're Definitely Skipping

Most teams build evals for happy-path quality. Almost nobody evals for adversarial inputs. That's a mistake.

Promptfoo's red team module is the easiest way to start. It scans for 50+ vulnerability types automatically:

promptfoo redteam init
promptfoo redteam run

This generates adversarial prompts that test for prompt injection, jailbreaks, PII leakage, toxic content generation, and tool misuse. It's the equivalent of running a security scanner, but for your LLM's behavior.

At minimum, every production LLM app should be tested for:

Prompt injection: Can a user override your system prompt?
Information leakage: Can a user extract your system prompt, API keys, or internal context?
Policy violation: Can a user get the bot to say something that violates your company's policies?
Hallucination under pressure: Does adversarial framing increase the hallucination rate?

If you're not testing for these, you're waiting for a security researcher (or a journalist) to find them for you.

The Cost of Not Evaluating

In October 2025, Deloitte submitted a report containing hallucinated academic sources and a fake quote from a federal court judgement. They had to issue a revised report and provide a partial refund. The reputational cost was worse than the financial one.

The worldwide spending on generative AI hit $644 billion in 2025, a 76.4% jump from 2024. The enterprise LLM market alone is projected to reach $71.1 billion by 2034. Companies are pouring money into building LLM apps. But the investment in testing those apps is a fraction of a fraction.

The math is simple. Building evals costs engineering time upfront. Not building evals costs customer trust, legal fees, and the kind of PR disaster that no amount of marketing can undo.

What I Actually Think

The LLM eval space is a mess. There are too many tools, too many metrics, and not enough practical guidance. Most teams I talk to are still doing "vibe checks" — manually spot-checking a few outputs and calling it tested.

Here's my honest take: you need exactly one eval tool, 20 test cases drawn from real failures, and the discipline to add a new case every time something breaks in production. That's it. That's the whole strategy.

Don't start with 60 metrics. Start with faithfulness and relevancy. Don't build a custom eval platform. Pick Promptfoo or DeepEval and ship something this week. Don't hire an "AI Quality" team. Train your existing engineers to think in evals instead of tests.

The companies that will win the AI application race aren't the ones with the fanciest models. They're the ones that know — with data, not vibes — whether their AI is actually working. Eval-driven development isn't optional. It's the difference between shipping an AI product and shipping an AI liability.

I think Promptfoo becoming part of OpenAI is a net positive for the ecosystem — it validates that eval tooling matters enough to acquire. I think DeepEval's pytest integration is the right abstraction for most Python teams. I think Braintrust is building the right product but at an $800M valuation, I wonder how they'll monetize without locking up core eval functionality behind a paywall.

And I think the biggest problem isn't tooling. It's culture. Engineers who've spent careers writing deterministic tests have to rewire their brains for probabilistic evaluation. That shift is harder than learning any framework. But it's the shift that matters.

Start with 20 failures. Build from there. Your users are already running evals on your product — they're just doing it in production, with their wallets.

LLM Evals Are Broken — How to Actually Test Your AI App Before Users Do

The Testing Gap Nobody Talks About

Why Traditional Testing Falls Apart

Evals vs Tests: The Mental Model Shift

The Three Tools That Actually Work

Promptfoo: The CLI-First Workhorse

DeepEval: Pytest for LLMs

Braintrust: The Full Lifecycle Platform

The Comparison Table

The LLM-as-Judge Trap

Building Your First Eval Pipeline (Practical Guide)

RAG Evals: The Special Case

Red Teaming: The Eval You're Definitely Skipping

The Cost of Not Evaluating

What I Actually Think

Sources

Enjoyed this article?

The Testing Gap Nobody Talks About

Why Traditional Testing Falls Apart

Evals vs Tests: The Mental Model Shift

The Three Tools That Actually Work

Promptfoo: The CLI-First Workhorse

DeepEval: Pytest for LLMs

Braintrust: The Full Lifecycle Platform

The Comparison Table

The LLM-as-Judge Trap

Building Your First Eval Pipeline (Practical Guide)

RAG Evals: The Special Case

Red Teaming: The Eval You're Definitely Skipping

The Cost of Not Evaluating

What I Actually Think

Sources