13 min read/5 views

RAG Is Not As Simple As They Tell You

RAG tutorials teach the easy 20%. Here are the five production problems they skip — and how to actually solve them.

AI Data Engineering LLM Python

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

13 min read/5 views

RAG Is Not As Simple As They Tell You

RAG tutorials teach the easy 20%. Here are the five production problems they skip — and how to actually solve them.

AI Data Engineering LLM Python

I built my first RAG system in about two hours. Followed a LangChain tutorial, chunked some documents, shoved them into Chroma, wired up a retrieval chain, and asked it questions. It worked. I was thrilled.

Then I tried it on real documents — messy PDFs with tables, headers, footers, and inconsistent formatting from a decade of scanned bank reports. The system confidently told me that the 2023 Q3 revenue was $4.2 million. The actual number was $42 million. It had parsed a table cell wrong, chunked the relevant row across two different chunks, and the LLM filled in the gap with a hallucinated decimal point.

That's RAG. Easy to demo. Brutal to productionize. And almost every tutorial skips the hard parts.

The Market Says RAG Is Simple. The Data Says Otherwise.

The RAG market is booming. Grand View Research estimates it at $1.2 billion in 2024, projected to hit $11 billion by 2030 at a 49.1% compound annual growth rate. Gartner forecasts that 40% of enterprise applications will embed task-specific AI agents by 2026, up from under 5% in 2025. Most of those agents will use RAG.

But here's the disconnect. According to LangChain's 2026 State of AI Agents report, 57% of organizations have agents in production, but quality is cited as the top barrier to deployment by 32% of respondents. Not cost. Not infrastructure. Quality.

McKinsey's latest survey shows 71% of organizations report regular use of GenAI, but only 17% attribute more than 5% of EBIT to it. That gap between adoption and value? A lot of it lives in the RAG pipeline — specifically, in the parts that tutorials don't cover.

The problem isn't that RAG doesn't work. It does. The problem is that tutorials teach the 20% of RAG that's easy and skip the 80% that's hard. Let me walk through the parts they skip.

Problem 1: Chunking Is Where Most RAG Systems Silently Break

Every RAG tutorial starts the same way: "First, split your documents into chunks." Then they use a recursive text splitter with 512 tokens and 50-token overlap, and move on.

That default works fine for blog posts and clean documentation. It breaks catastrophically on real-world documents.

Here's why. Naive fixed-size chunking doesn't respect the semantic structure of text. It can cut a sentence in half, split a table row across two chunks, or combine the end of one section with the beginning of an unrelated one. The embedding model then creates a vector that represents a Frankenstein chunk — half about revenue, half about employee headcount — and that vector matches poorly against any reasonable query.

What the benchmarks actually show

This is where it gets interesting. You'd think semantic chunking — splitting text by meaning rather than character count — would be the obvious fix. Recent benchmarks say otherwise.

A Vectara NAACL 2025 peer-reviewed study found that on realistic document sets, fixed-size chunking consistently outperformed semantic chunking across document retrieval, evidence retrieval, and answer generation tasks. A February 2026 benchmark of 7 strategies across 50 academic papers placed recursive 512-token splitting first at 69% accuracy, while semantic chunking landed at 54%.

Why? Semantic chunking produced fragments averaging just 43 tokens — clean semantically, but too short to give the LLM enough context to generate correct answers. The chunks were "relevant" but not "useful."

What actually works

Strategy	Best For	Accuracy (benchmark)	Trade-off
Recursive 512 tokens, 10-20% overlap	General text, starting point	69%	Cuts mid-sentence sometimes
Document-structure-aware	PDFs with headers/sections	Varies, often best on structured docs	Needs document parsing
Parent-child retrieval	Complex multi-section docs	Higher for multi-hop questions	More complex indexing
Semantic chunking	Highly varied documents	54%	Expensive, fragments too small

My recommendation: start with recursive 512-token chunking and 10-20% overlap. It's boring, it's not clever, and it works better than the alternatives in most benchmarks. Move to structure-aware chunking only when you've measured that retrieval quality is the bottleneck — not the LLM, not the prompt, not the embedding model.

# The boring default that actually works
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,  # ~12% overlap
    separators=["\n\n", "\n", ". ", " ", ""],
)

# NOT this (unless you've benchmarked it):
# from langchain_experimental.text_splitter import SemanticChunker
# splitter = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")

Problem 2: Your Documents Are Worse Than You Think

The second thing tutorials skip: real documents are messy. I mean genuinely, catastrophically messy.

The PDF problem

PDFs are the most common enterprise document format and the worst format for RAG. A PDF is a set of instructions for placing text and shapes on a page. It doesn't have paragraphs, headings, or tables in a semantic sense. It has coordinates.

A table that looks perfectly aligned to your eyes is, to a PDF parser, a collection of text fragments positioned at specific x,y coordinates on a page. Extracting that table correctly requires reconstructing the spatial relationships — figuring out which text fragments belong in which cells. Most parsers get this wrong some of the time.

OCR noise is a documented problem: semantic noise from prediction errors and formatting noise from diverse document layouts. Headers and footers that repeat on every page get chunked alongside actual content, polluting retrieval results. Scanned documents add OCR errors on top of everything else.

What this means in practice

If you're building RAG over corporate documents, you'll spend more time on document parsing than on any other part of the pipeline. That's not a bug — it's the reality of the problem.

# What tutorials show:
docs = PyPDFLoader("report.pdf").load()

# What production requires:
# 1. Extract text with layout-aware parser
# 2. Detect and separately parse tables
# 3. Remove headers, footers, page numbers
# 4. Reconstruct section hierarchy
# 5. Handle multi-column layouts
# 6. Deal with scanned pages (OCR)
# 7. Validate extraction quality

# Tools that actually handle this:
# - Unstructured.io (best open-source option)
# - LlamaParse (good for complex PDFs)
# - Amazon Textract (best for tables)
# - Azure Document Intelligence (enterprise)

I've seen teams spend three weeks building a RAG prototype and three months fixing document parsing. The parsing isn't the glamorous part, but it determines whether your system returns "$4.2 million" or "$42 million." Kind of important.

The table problem specifically

Tables deserve special attention because they're everywhere in enterprise documents and they break every naive RAG approach. A financial report has tables. A technical specification has tables. An HR policy document has tables. And standard chunking treats table content as regular text, destroying the row-column relationships that give the data meaning.

The fix: extract tables separately, serialize them to a structured format (markdown tables or JSON), and chunk each table as an atomic unit with its caption and surrounding context. It's more work upfront, but it prevents the class of errors where your system gives a number from the wrong row or the wrong column — which in my experience accounts for about 40% of factual errors in document-heavy RAG systems.

# Table-aware chunking pattern
def process_document(doc_path):
    # Step 1: extract with layout awareness
    elements = partition(filename=doc_path, strategy="hi_res")

    chunks = []
    for element in elements:
        if element.category == "Table":
            # Keep tables atomic with their title
            chunks.append({
                "content": element.metadata.text_as_html,
                "type": "table",
                "page": element.metadata.page_number,
            })
        else:
            # Regular text goes through normal chunking
            chunks.extend(text_splitter.split_text(element.text))

    return chunks

Problem 3: Retrieval Is Not Just Vector Search

Most RAG tutorials use pure vector similarity search: embed the query, find the k nearest chunks, pass them to the LLM. This works for straightforward factual questions. It falls apart in at least three common scenarios.

When vector search fails

Exact match queries: A user asks "What was the revenue in Q3 2023?" Vector search returns chunks that are semantically similar to "revenue" and "Q3 2023" — but might miss the exact chunk with that specific number because a chunk about "Q2 2023 revenue growth" has a higher cosine similarity.

Keyword-dependent queries: Legal documents, technical specifications, product codes. "What does clause 4.3.2(b) say?" Vector embeddings don't encode exact clause numbers well.

Negation and comparison queries: "Which products did NOT meet the threshold?" Vector search can't distinguish between chunks that discuss meeting vs. not meeting a threshold — the embeddings are nearly identical.

Multi-hop questions: "How did the company's revenue trend compare to the industry average?" This requires finding revenue data AND industry benchmarks AND connecting them. Single-stage retrieval typically grabs chunks about one or the other, not both. Query decomposition — breaking the question into sub-queries — helps, but it adds complexity and latency.

I've found that roughly 30-40% of real user questions in enterprise settings fall into these failure categories. If your evaluation only tests straightforward factual lookups, you're measuring the easy cases and missing the ones where your system actually fails.

Hybrid search: the fix that actually works

Hybrid search combines dense vector embeddings with sparse keyword indices — typically BM25 for lexical matching and vector search for semantic matching. You get the best of both: semantic understanding for "What's our customer retention strategy?" and exact matching for "What does section 4.3.2 say?"

Reranking: the multiplier

After initial retrieval (whether vector, keyword, or hybrid), reranking reorders the results so the most relevant chunks rise to the top before passing them to the LLM. Cross-encoder rerankers like Cohere Rerank or ColBERT are significantly more accurate than the initial retrieval — they just can't run against your entire document collection because they're too slow. So you retrieve broadly (top 20-50 chunks) and rerank precisely (top 3-5).

The improvement from adding reranking is often larger than switching embedding models, changing chunk sizes, or any other single optimization. If you change one thing about your RAG pipeline, add a reranker.

The embedding model question

People agonize over which embedding model to use. OpenAI's text-embedding-3-large vs. Cohere's embed-v3 vs. open-source options like BGE or E5. Here's the honest truth: the embedding model matters less than your chunking strategy, less than your retrieval method, and much less than your document parsing quality.

That said, a few practical guidelines. For English-only production systems, OpenAI's text-embedding-3-small is a solid default — good quality, low cost, fast. For multilingual or specialized domains, Cohere embed-v3 or fine-tuned BGE models outperform. For cost-sensitive or privacy-sensitive deployments, open-source models like BGE-large-en-v1.5 run locally and perform surprisingly well.

The mistake I see: teams spend two weeks benchmarking embedding models while their PDF parser is silently dropping every third table. Fix the fundamentals first.

# Naive RAG: single-stage vector retrieval
results = vectorstore.similarity_search(query, k=5)

# Better: hybrid search + reranking
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_cohere import CohereRerank

# Stage 1: broad retrieval from two sources
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
bm25_retriever = BM25Retriever.from_documents(docs, k=20)
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.5, 0.5],
)

# Stage 2: rerank to find the actual best chunks
reranker = CohereRerank(top_n=5)
results = reranker.compress_documents(
    ensemble.invoke(query), query
)

Problem 4: You're Not Measuring Anything

This is the biggest problem, and it's the one I see in almost every RAG project I review.

Teams build their pipeline, eyeball a few queries, say "looks good," and ship it. No systematic evaluation. No baseline metrics. No way to know if changes improve or degrade quality.

RAG systems have dual failure points: retrieval can miss relevant documents, and generation can hallucinate or ignore context entirely. If you're not measuring both, you're flying blind.

The metrics that matter

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used evaluation framework. The core metrics:

Metric	What It Measures	Why It Matters
Faithfulness	Are claims in the answer supported by retrieved context?	Catches hallucinations
Answer Relevancy	Does the answer actually address the question?	Catches off-topic responses
Context Precision	Are the retrieved chunks actually relevant?	Measures retrieval quality
Context Recall	Did retrieval find all the relevant information?	Catches missed context

Faithfulness measures the hallucination rate of your system: the fraction of claims in the answer that can be supported by the retrieved context. If your faithfulness score is 0.7, that means 30% of the claims in your answers aren't grounded in the source material.

The minimum viable evaluation

You don't need a massive evaluation suite to start. You need these three things:

# 1. A golden dataset (even 50 question-answer pairs)
eval_set = [
    {
        "question": "What was Q3 2023 revenue?",
        "ground_truth": "$42 million",
        "source_doc": "annual_report_2023.pdf",
    },
    # ... at least 50 examples covering edge cases
]

# 2. RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

# 3. A tracking system (even a spreadsheet works)
# Log: date, pipeline config, faithfulness, relevancy, precision
# This lets you measure impact of every change

The golden dataset is the hard part. Building 50 high-quality question-answer pairs with source references takes a full day of work. Most teams skip this because it's tedious. But without it, every change you make to your pipeline is a coin flip — you don't know if you improved or regressed.

Problem 5: RAG vs. Fine-Tuning vs. Long Context — Choosing Wrong

The final mistake I see constantly: using RAG when it's not the right tool.

IBM's framework for choosing between RAG and fine-tuning is straightforward: RAG is for knowledge (facts that change), fine-tuning is for behavior (style, format, policy adherence). But people use RAG for everything.

When RAG is NOT the answer

The data fits in context. If your entire knowledge base is under 100K tokens, just put it in the prompt. Modern models support 128K-1M token contexts. RAG adds complexity for no benefit if the data is small enough
You need behavioral consistency. If your problem is "the model doesn't respond in the right format" or "the tone is wrong," that's a fine-tuning problem, not a retrieval problem
The answers require complex reasoning across many documents. RAG retrieves chunks. If the answer requires synthesizing information from 15 different documents, retrieval will likely miss some of them. Multi-hop RAG and agentic approaches help, but they add significant complexity
The data doesn't change. If your knowledge base is static product documentation that updates quarterly, fine-tuning on that data might give better results with less runtime complexity

The decision framework

# When to use what
RAG:
  best_for: "Dynamic knowledge, Q&A over documents, fact lookup"
  failure_mode: "Missing or stale facts in responses"
  example: "Customer support over documentation that updates weekly"

Fine-tuning:
  best_for: "Behavior, tone, format consistency, classification"
  failure_mode: "Wrong style, format errors, policy violations"
  example: "Model that always responds in JSON with company voice"

Long context (no RAG):
  best_for: "Small, static knowledge bases under 100K tokens"
  failure_mode: "Over-engineering a simple problem"
  example: "FAQ bot with 200 questions and answers"

Hybrid (RAG + fine-tuning):
  best_for: "Production systems needing both accuracy and consistency"
  failure_mode: "Nothing—this is the 2026 production default"
  example: "Enterprise assistant with brand voice and live data"

The 2025 LaRA benchmark (ICML/PMLR) found no silver bullet: the better choice depends on task type, model behavior, context length, and retrieval setup. In 2026, hybrid systems — RAG for facts, fine-tuning for behavior — are the practical default for production-grade quality.

The Production RAG Checklist

If you're building RAG for production (not a demo), here's the checklist I use. Every item I've learned the hard way.

Document Processing:

Parse documents with a layout-aware tool (Unstructured, LlamaParse), not PyPDFLoader
Handle tables separately — extract, serialize to markdown or JSON, chunk as atomic units
Remove headers, footers, page numbers before chunking
Test parsing quality on your worst documents, not your best

Chunking:

Start with recursive 512 tokens, 10-20% overlap
Include document metadata (source, page, section) in each chunk
Don't over-optimize chunking until you've measured retrieval quality

Retrieval:

Use hybrid search (vector + BM25) from day one
Add a reranker (Cohere, ColBERT, or a cross-encoder)
Retrieve broadly (top 20), rerank precisely (top 3-5)

Generation:

Include source citations in the prompt template
Instruct the model to say "I don't know" when context is insufficient
Set temperature to 0 or near-0 for factual Q&A

Evaluation:

Build a golden dataset of at least 50 question-answer pairs
Measure faithfulness, answer relevancy, and context precision
Re-run evals after every pipeline change
Log everything: queries, retrieved chunks, responses, latency, cost

Monitoring:

Track retrieval quality over time (are scores degrading?)
Monitor for new document types that your parser can't handle
Set up alerts for faithfulness drops below your threshold

What I Actually Think

RAG is being sold as a simple pattern: chunk, embed, retrieve, generate. Four steps. Follow this tutorial. Ship it in a weekend.

That's a lie. Not an intentional one — most tutorial authors genuinely believe their approach works. And it does, on their clean demo data. On clean markdown files with consistent formatting and straightforward factual questions, RAG is almost trivially easy.

Production data isn't clean. Enterprise documents are PDFs from 2009 with scanned tables and inconsistent fonts. User questions are ambiguous. The knowledge base has contradictory information across different document versions. And the system needs to say "I don't know" exactly when it should — no more, no less.

The real RAG skill stack isn't "call LangChain." It's document engineering, information retrieval theory, evaluation methodology, and system design. Those are deep disciplines. The fact that you can build a demo in two hours doesn't mean you can build a production system in two weeks.

My strongest take: the most important part of any RAG system is evaluation, and it's the part everyone skips. Without a golden dataset and systematic metrics, you're guessing. You'll spend weeks tweaking chunk sizes and embedding models when the actual problem is that your PDF parser is mangling tables. I've seen this exact pattern at least a dozen times.

The RAG market is going to $11 billion by 2030 for a reason — the pattern works when done right. But "done right" means treating it as a serious engineering problem with testing, monitoring, and iteration. Not a tutorial you copy-paste on a Friday afternoon.

I'll go further: the teams that win at RAG in 2026 won't be the ones with the fanciest retrieval algorithms. They'll be the ones with the best document processing pipelines and the most rigorous evaluation suites. The boring infrastructure work. The stuff that doesn't make for good Twitter threads.

Here's a useful mental model. Think of RAG like search engineering, not like AI research. Google didn't win search by having the best neural network (they didn't have one when they started). They won by having the best crawling, indexing, and ranking infrastructure. RAG is the same. The LLM is impressive, but the engineering around it — how you parse documents, how you index them, how you retrieve and rank results, how you measure quality — that's what separates systems that work from systems that hallucinate $4.2 million instead of $42 million.

Start boring. Measure everything. Fix what's actually broken, not what feels broken. Your users won't know or care whether you used semantic chunking or recursive splitting. They'll care whether the answer was right.

Sources

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

That's RAG. Easy to demo. Brutal to productionize. And almost every tutorial skips the hard parts.

The Market Says RAG Is Simple. The Data Says Otherwise.

The problem isn't that RAG doesn't work. It does. The problem is that tutorials teach the 20% of RAG that's easy and skip the 80% that's hard. Let me walk through the parts they skip.

Problem 1: Chunking Is Where Most RAG Systems Silently Break

Every RAG tutorial starts the same way: "First, split your documents into chunks." Then they use a recursive text splitter with 512 tokens and 50-token overlap, and move on.

That default works fine for blog posts and clean documentation. It breaks catastrophically on real-world documents.

What the benchmarks actually show

This is where it gets interesting. You'd think semantic chunking — splitting text by meaning rather than character count — would be the obvious fix. Recent benchmarks say otherwise.

What actually works

Strategy	Best For	Accuracy (benchmark)	Trade-off
Recursive 512 tokens, 10-20% overlap	General text, starting point	69%	Cuts mid-sentence sometimes
Document-structure-aware	PDFs with headers/sections	Varies, often best on structured docs	Needs document parsing
Parent-child retrieval	Complex multi-section docs	Higher for multi-hop questions	More complex indexing
Semantic chunking	Highly varied documents	54%	Expensive, fragments too small

# The boring default that actually works
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,  # ~12% overlap
    separators=["\n\n", "\n", ". ", " ", ""],
)

# NOT this (unless you've benchmarked it):
# from langchain_experimental.text_splitter import SemanticChunker
# splitter = SemanticChunker(embeddings, breakpoint_threshold_type="percentile")

Problem 2: Your Documents Are Worse Than You Think

The second thing tutorials skip: real documents are messy. I mean genuinely, catastrophically messy.

The PDF problem

What this means in practice

If you're building RAG over corporate documents, you'll spend more time on document parsing than on any other part of the pipeline. That's not a bug — it's the reality of the problem.

# What tutorials show:
docs = PyPDFLoader("report.pdf").load()

# What production requires:
# 1. Extract text with layout-aware parser
# 2. Detect and separately parse tables
# 3. Remove headers, footers, page numbers
# 4. Reconstruct section hierarchy
# 5. Handle multi-column layouts
# 6. Deal with scanned pages (OCR)
# 7. Validate extraction quality

# Tools that actually handle this:
# - Unstructured.io (best open-source option)
# - LlamaParse (good for complex PDFs)
# - Amazon Textract (best for tables)
# - Azure Document Intelligence (enterprise)

The table problem specifically

# Table-aware chunking pattern
def process_document(doc_path):
    # Step 1: extract with layout awareness
    elements = partition(filename=doc_path, strategy="hi_res")

    chunks = []
    for element in elements:
        if element.category == "Table":
            # Keep tables atomic with their title
            chunks.append({
                "content": element.metadata.text_as_html,
                "type": "table",
                "page": element.metadata.page_number,
            })
        else:
            # Regular text goes through normal chunking
            chunks.extend(text_splitter.split_text(element.text))

    return chunks

Problem 3: Retrieval Is Not Just Vector Search

When vector search fails

Keyword-dependent queries: Legal documents, technical specifications, product codes. "What does clause 4.3.2(b) say?" Vector embeddings don't encode exact clause numbers well.

Hybrid search: the fix that actually works

Reranking: the multiplier

The embedding model question

The mistake I see: teams spend two weeks benchmarking embedding models while their PDF parser is silently dropping every third table. Fix the fundamentals first.

# Naive RAG: single-stage vector retrieval
results = vectorstore.similarity_search(query, k=5)

# Better: hybrid search + reranking
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_cohere import CohereRerank

# Stage 1: broad retrieval from two sources
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
bm25_retriever = BM25Retriever.from_documents(docs, k=20)
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    weights=[0.5, 0.5],
)

# Stage 2: rerank to find the actual best chunks
reranker = CohereRerank(top_n=5)
results = reranker.compress_documents(
    ensemble.invoke(query), query
)

Problem 4: You're Not Measuring Anything

This is the biggest problem, and it's the one I see in almost every RAG project I review.

Teams build their pipeline, eyeball a few queries, say "looks good," and ship it. No systematic evaluation. No baseline metrics. No way to know if changes improve or degrade quality.

RAG systems have dual failure points: retrieval can miss relevant documents, and generation can hallucinate or ignore context entirely. If you're not measuring both, you're flying blind.

The metrics that matter

RAGAS (Retrieval Augmented Generation Assessment) is the most widely used evaluation framework. The core metrics:

Metric	What It Measures	Why It Matters
Faithfulness	Are claims in the answer supported by retrieved context?	Catches hallucinations
Answer Relevancy	Does the answer actually address the question?	Catches off-topic responses
Context Precision	Are the retrieved chunks actually relevant?	Measures retrieval quality
Context Recall	Did retrieval find all the relevant information?	Catches missed context

The minimum viable evaluation

You don't need a massive evaluation suite to start. You need these three things:

# 1. A golden dataset (even 50 question-answer pairs)
eval_set = [
    {
        "question": "What was Q3 2023 revenue?",
        "ground_truth": "$42 million",
        "source_doc": "annual_report_2023.pdf",
    },
    # ... at least 50 examples covering edge cases
]

# 2. RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

# 3. A tracking system (even a spreadsheet works)
# Log: date, pipeline config, faithfulness, relevancy, precision
# This lets you measure impact of every change

Problem 5: RAG vs. Fine-Tuning vs. Long Context — Choosing Wrong

The final mistake I see constantly: using RAG when it's not the right tool.

When RAG is NOT the answer

The data fits in context. If your entire knowledge base is under 100K tokens, just put it in the prompt. Modern models support 128K-1M token contexts. RAG adds complexity for no benefit if the data is small enough
You need behavioral consistency. If your problem is "the model doesn't respond in the right format" or "the tone is wrong," that's a fine-tuning problem, not a retrieval problem
The answers require complex reasoning across many documents. RAG retrieves chunks. If the answer requires synthesizing information from 15 different documents, retrieval will likely miss some of them. Multi-hop RAG and agentic approaches help, but they add significant complexity
The data doesn't change. If your knowledge base is static product documentation that updates quarterly, fine-tuning on that data might give better results with less runtime complexity

The decision framework

# When to use what
RAG:
  best_for: "Dynamic knowledge, Q&A over documents, fact lookup"
  failure_mode: "Missing or stale facts in responses"
  example: "Customer support over documentation that updates weekly"

Fine-tuning:
  best_for: "Behavior, tone, format consistency, classification"
  failure_mode: "Wrong style, format errors, policy violations"
  example: "Model that always responds in JSON with company voice"

Long context (no RAG):
  best_for: "Small, static knowledge bases under 100K tokens"
  failure_mode: "Over-engineering a simple problem"
  example: "FAQ bot with 200 questions and answers"

Hybrid (RAG + fine-tuning):
  best_for: "Production systems needing both accuracy and consistency"
  failure_mode: "Nothing—this is the 2026 production default"
  example: "Enterprise assistant with brand voice and live data"

The Production RAG Checklist

If you're building RAG for production (not a demo), here's the checklist I use. Every item I've learned the hard way.

Document Processing:

Parse documents with a layout-aware tool (Unstructured, LlamaParse), not PyPDFLoader
Handle tables separately — extract, serialize to markdown or JSON, chunk as atomic units
Remove headers, footers, page numbers before chunking
Test parsing quality on your worst documents, not your best

Chunking:

Start with recursive 512 tokens, 10-20% overlap
Include document metadata (source, page, section) in each chunk
Don't over-optimize chunking until you've measured retrieval quality

Retrieval:

Use hybrid search (vector + BM25) from day one
Add a reranker (Cohere, ColBERT, or a cross-encoder)
Retrieve broadly (top 20), rerank precisely (top 3-5)

Generation:

Include source citations in the prompt template
Instruct the model to say "I don't know" when context is insufficient
Set temperature to 0 or near-0 for factual Q&A

Evaluation:

Build a golden dataset of at least 50 question-answer pairs
Measure faithfulness, answer relevancy, and context precision
Re-run evals after every pipeline change
Log everything: queries, retrieved chunks, responses, latency, cost

Monitoring:

Track retrieval quality over time (are scores degrading?)
Monitor for new document types that your parser can't handle
Set up alerts for faithfulness drops below your threshold

What I Actually Think

RAG is being sold as a simple pattern: chunk, embed, retrieve, generate. Four steps. Follow this tutorial. Ship it in a weekend.