Llama 4 Scout claims a 10-million-token context window. That's roughly 15,000 pages of text, 500,000 lines of code, or the entire Harry Potter series fifteen times over. When Meta announced it in April 2025, the number dominated every headline. What didn't make headlines: on Fiction.LiveBench -- a test that measures whether a model actually understands what it reads -- Scout scored 15.6% at just 128K tokens. Gemini 2.5 Pro scored 90.6% on the same test.
The model that claims to read 10 million tokens can barely comprehend 128 thousand.
This is the story of the context window arms race: where the marketing is enormous, the reality is complicated, and "more context" is almost never the answer you think it is.
The Context Window Landscape in 2026
Every major AI lab now ships at least 1 million tokens of context. The race has produced a clear hierarchy:
| Model | Context Window | Provider | Type | Release |
|---|
| LTM-2-Mini | 100M tokens | Magic.dev | Research/private | Aug 2024 |
| Llama 4 Scout | 10M tokens | Meta | Open-weight | Apr 2025 |
| Grok 4.20 | 2M tokens | xAI | Proprietary | Mar 2026 |
| Claude Opus 4.6 | 1M tokens | Anthropic | Proprietary | Mar 2026 |
| GPT-5.4 | 1.05M tokens | OpenAI | Proprietary | Mar 2026 |
| Gemini 3.1 Pro | 1M tokens | Google | Proprietary | Feb 2026 |
| Llama 4 Maverick | 1M tokens | Meta | Open-weight | Apr 2025 |
| Qwen 3.6 Plus | 1M tokens | Alibaba | Proprietary | Mar 2026 |
| Mistral Small 4 | 256K tokens | Mistral | Open-weight | Mar 2026 |
| DeepSeek V3.2 | 128K tokens | DeepSeek | Open-weight | 2025 |
Two years ago, 128K was impressive. Now five models offer 1M+, two offer 2M+, and Meta claims 10M. Context size has become table stakes. The question isn't "how big" anymore. It's "how well."
How Meta Actually Built a 10M Context Window
The engineering behind Scout's context window is genuinely clever, even if the end result disappoints.
The core innovation is iRoPE -- Interleaved Rotary Position Embeddings. Here's how it works:
Every four transformer layers follow a repeating pattern:
- Layers 1-3: Standard RoPE positional encoding with chunked attention -- each token can only attend to other tokens within its local 8,192-token chunk. This is cheap. It's basically sliding-window attention.
- Layer 4: No positional encoding (NoPE) with full causal attention -- this layer sees the entire context. All 10 million tokens. Temperature scaling prevents the softmax attention from collapsing to near-zero at extreme lengths.
So 75% of the model's layers only look at 8K tokens at a time. The other 25% periodically sweep the full context. This is how you make 10M tokens computationally feasible -- you cheat, elegantly. Most of the model operates locally. A minority of layers operate globally.
The model was pre-trained on over 30 trillion tokens across 200 languages, but at a 256K context length -- not 10M. The jump from 256K to 10M happens through length generalization during instruction tuning. The model was never trained on 10M-token sequences. It extrapolates.
This matters. A lot.
What 10M Tokens Actually Looks Like
Before we talk about whether it works, let's ground the number in reality.
| Content Type | Approximate Token Count | Fits in 10M? |
|---|
| One page of text | ~500 tokens | Yes (20,000 pages) |
| Average novel (80K words) | ~100K tokens | Yes (~100 novels) |
| War and Peace + entire Harry Potter | ~1.5M tokens | Yes, 6x over |
| Full codebase (500K lines) | ~2M tokens | Yes, 5x over |
| 10 hours of audio transcription | ~1M tokens | Yes, 10x over |
| One hour of video at 1fps | ~1M tokens | Yes, 10x over |
| Entire English Wikipedia | ~4B tokens | No |
Source: estimates from Digital Applied and Gemini 1.5 technical report
The promise is tantalizing. Dump your entire codebase, all your legal contracts, or a semester's worth of lecture transcripts into a single prompt. No chunking, no RAG pipeline, no retrieval step. Just... throw everything in.
The reality is different.
The Performance Gap: Advertised vs. Effective Context
Here's the data that context window marketing doesn't want you to see.
The 60-70% Rule
Research consistently shows that models' effective context -- the range where they maintain reliable performance -- is roughly 60-70% of their advertised maximum. A 200K model becomes unreliable around 130K. A 1M model starts degrading around 600-700K.
For Scout's 10M, applying this rule generously gives an effective range of 6-7M for simple retrieval. For synthesis and reasoning? Much lower.
Context Rot Is Universal
Chroma Research tested 18 frontier models -- including GPT-4.1, Claude Opus 4, and Gemini 2.5 -- and found that every single model exhibited continuous performance degradation as input length increased. Not a cliff. A slope. It starts early and gets worse.
Their most surprising finding: models performed worse on logically structured documents than on randomly shuffled text. Coherent structure triggers recency bias -- the model leans on the last few paragraphs rather than synthesizing the whole document. More structure, paradoxically, means worse long-context performance.
The Lost-in-the-Middle Problem
Stanford and UC Berkeley researchers demonstrated that LLMs exhibit a U-shaped performance curve: they recall information well from the beginning and end of the context, but accuracy drops 30%+ when relevant information is buried in the middle. This isn't a Scout-specific problem. It affects every model. And it gets worse as context grows, because there's more "middle" to get lost in.
Scout's Actual Benchmarks
Meta released almost no long-context evaluations beyond needle-in-a-haystack. Nathan Lambert (Interconnects.ai) noted that the absence of RULER benchmark results and NoLiMa evaluations was conspicuous. NIAH is the easiest possible long-context test -- find one specific sentence in a pile of text. It's necessary but not sufficient.
Independent evaluations told a grim story:
| Test | Scout Score | Competitor Score | Gap |
|---|
| Fiction.LiveBench (128K) | 15.6% | Gemini 2.5 Pro: 90.6% | -75 points |
| GPQA Diamond | 57.2% | Maverick: 69.8% | -12.6 points |
| LiveCodeBench | 32.8% | Maverick: 43.4% | -10.6 points |
| ARC-AGI-2 | 0.0% | (baseline) | Failed |
| Aider Polyglot (coding) | 16% | Qwen 2.5 Coder: higher | Poor |
At 300K tokens, one independent tester reported the model "collapsed completely" -- failing to identify hidden test sentences and instead generating responses from pre-training knowledge rather than analyzing the provided text.
Maverick (the smaller sibling with 1M context) outperforms Scout on all 11 standard benchmarks. The model with 10x less context is better at everything.
The Hardware Reality Check
Let's say you believe in the 10M promise and want to use it. What does it cost?
The KV cache -- the memory structure that stores information about all previous tokens -- grows linearly with context length. At 10M tokens, one analysis calculated the KV cache alone requires approximately 32TB of memory. A single H100 has 80GB. You'd need roughly 240 H100 GPUs just for the cache.
More conservative estimates still require massive hardware:
| Context Length | KV Cache Memory | Hardware Needed | Approximate Cost |
|---|
| 32K tokens | ~2 GB | 1x H100 (with model) | $20K |
| 128K tokens | ~8 GB | 1x H100 | $20K |
| 1M tokens | ~64 GB | 8x H100 | $160K |
| 3.6M tokens | ~230 GB | 8x H200 | $280K |
| 10M tokens | ~410-32,000 GB | 7-240x H100 | $140K-$4.8M |
Source: vLLM Llama 4 blog, hardware analysis, APXML
The range depends on quantization and optimization. But even the optimistic end -- 7 H100s at ~$90/hour in the cloud -- means a single 10M-token query costs real money. Contrast that with RAG, where retrieving 5-10K relevant tokens costs fractions of a penny.
In practice, most API providers cap Scout at 128K to 1M tokens for consistent performance and economic viability.
What You Can Actually Do With It
Enough doom. Here are the use cases where long context genuinely works -- and where it doesn't.
Where Long Context Wins
Entire-codebase analysis. Loading 40-50K lines of code into a 1M context window lets you ask "if I change this function, what breaks?" without building a retrieval pipeline. Sourcegraph testing showed improvements in recall and helpfulness when full codebases fit in context. This is the killer app for long context in 2026.
Bounded document analysis. A set of 50-100 legal contracts. A full quarterly financial report. A patient's complete medical history. When the corpus is fixed, bounded, and fits in context, dumping it all in works. No chunking artifacts. No retrieval misses.
Long-running agent sessions. AI agents that make 20-30+ tool calls accumulate significant context. A 1M window means the agent remembers everything it's done in a session without summarization tricks that lose detail.
Cross-document reasoning. Finding contradictions across five research papers. Comparing clauses across a dozen contracts. Tasks where the answer depends on information scattered across multiple documents and you need the model to see all of them simultaneously.
Where Long Context Fails
Synthesis over massive corpora. Asking "summarize the themes across these 500 documents" sounds like a context window problem. It's not. Beyond ~1-2M tokens, synthesis quality degrades severely. The model can retrieve from 10M tokens (find a specific fact), but it can't reason across 10M tokens (connect ideas from beginning, middle, and end).
Dynamic or frequently updated data. If your knowledge base changes weekly, stuffing it into context every query is expensive and wasteful. RAG with an updated index is orders of magnitude more efficient.
Permission-sensitive data. Long context has no concept of access control. If User A shouldn't see Document B, you can't dump everything into one context. RAG systems can filter by permissions.
Cost-sensitive applications. Processing 10M tokens costs $1.10-$120 per query depending on model and provider. A well-optimized RAG system retrieving the relevant 5-10K tokens costs ~$0.00008. That's a 1,250x cost difference.
RAG Is Not Dead. Stop Saying RAG Is Dead.
Every time a larger context window ships, someone writes "RAG is dead." It's been declared dead after Gemini 1.5's 1M, after Llama 4's 10M, and probably will be after Magic.dev's 100M.
RAG is not dead. The math doesn't support it.
| Dimension | Long Context (10M) | RAG |
|---|
| Cost per query | $1.10-$120 | ~$0.00008 |
| Latency | 30-60 seconds at high token counts | ~1 second |
| Compute overhead | 260% overhead vs 2K context at 128K | Minimal per-query |
| Access control | None | Permission-aware filtering |
| Citations | Weak unless post-processed | Built into pipeline |
| Corpus size limit | 10M tokens (~15K pages) | Unlimited |
| Data freshness | Stale after context creation | Updated with index |
The 2026 consensus, and I agree with it: use both. RAG retrieves the most relevant documents from your entire knowledge base. Long context reasons over those retrieved documents. RAG handles scale (millions of documents). Long context handles depth (hundreds of pages). Andrej Karpathy calls this "context engineering" -- the art and science of filling the context window with just the right information for the next step.
The hybrid approach outperforms either method alone. Use RAG to narrow 10 million documents to 100 relevant ones. Load those 100 into a 1M context window. Let the model reason across them. That's the architecture that works.
The Benchmark Controversy You Should Know About
Meta's Llama 4 launch was marred by a credibility crisis that colors everything about the 10M context claim.
Meta submitted a specially crafted, non-public version of Llama 4 Maverick to LM Arena -- one "optimized for conversationality" with verbose, emoji-filled outputs. The public release version was nothing like it. When LM Arena tested the actual release, Maverick dropped from #2 to #32. Scout fell out of the top 100.
LM Arena stated that Meta's interpretation of their policy "did not match what we expect from model providers." A departing Meta AI Chief confirmed: "Results were fudged."
AI commentator Zvi Mowshowitz called it "by far the most negative reaction I have seen to a model release." The LocalLlama subreddit described Scout as "severely underwhelming on all fronts."
When a company manipulates benchmarks on one dimension, it's reasonable to scrutinize their other claims. Including the 10M number.
A Practical Decision Framework
Here's what I'd actually recommend based on the data:
For Codebase Analysis (Your Best Use Case)
Use a 1M context model (Maverick, Claude Opus 4.6, Gemini 3.1 Pro) with your full codebase loaded. Don't bother with Scout's 10M -- Maverick performs better on every coding benchmark at 1M context. If your codebase exceeds 1M tokens, use RAG to select the relevant modules first.
For Document Processing
Under 100 documents: Load them all into 1M context. Simple and effective.
100-1,000 documents: Hybrid approach. RAG retrieves the top 20-50 most relevant. Load those into context for cross-document reasoning.
Over 1,000 documents: Pure RAG. No context window is large enough to hold thousands of documents effectively, and even if it were, the lost-in-the-middle problem would destroy accuracy.
For Choosing a Long-Context Model
| Priority | Best Choice | Why |
|---|
| Maximum recall quality | Claude Opus 4.6 | 76% on MRCR 8-needle at 1M -- best in class |
| Cost-efficient bulk processing | Gemini 3.1 Pro | $2/MTok input, strong 1M performance |
| Self-hosted / private | Llama 4 Maverick | Apache-like license, 1M context, runs on 8x H100 |
| Budget self-hosted | Qwen 3.5-122B | MoE, fits on 1x H100 quantized, 262K context |
| Pure retrieval at scale | Llama 4 Scout | If you literally only need "find X in 10M tokens" |
Context Engineering Best Practices
- Place critical information at the beginning or end of your context. The middle is where facts go to die.
- Pre-summarize long documents before loading them. A 5-page summary is more useful in context than 500 raw pages.
- Use structured separators -- clear headers, section markers, and document boundaries help the model navigate large contexts.
- Monitor your effective token usage. If you're loading 500K tokens but the model only needs 50K of them, you're wasting 10x on compute.
- Test at your actual context length. A model that benchmarks well at 32K may fall apart at 256K. Evaluate on your data at your scale.
What I Actually Think
The 10M context window is a marketing number. Not a lie -- the architecture supports it -- but a number designed to generate headlines rather than solve problems.
Here's my evidence. The model was trained at 256K context and extrapolates to 10M through length generalization. The KV cache at 10M tokens requires hardware that most organizations don't have. Independent evaluations show the model collapsing at 300K tokens for synthesis tasks. Meta released no serious long-context benchmarks beyond needle-in-a-haystack. And the company was caught manipulating benchmarks on the same model release.
The useful Scout is a 256K-1M retrieval model. In that range, it's a solid open-weight option for finding specific information in large document sets. It's surprisingly good at low-resource language translation -- Swahili, Georgian, languages that other models struggle with. That's a real strength that Meta barely marketed.
But 10M tokens of genuine comprehension? No. Not with current architecture. Not with length generalization from 256K training. The gap between "can technically accept 10M tokens as input" and "can reason meaningfully over 10M tokens" is enormous.
More broadly, I think the context window arms race has reached diminishing returns. We went from 4K (GPT-3.5) to 128K (GPT-4) to 1M (five models in 2026) in three years. Each jump mattered less than the one before. The 4K-to-128K jump was transformative -- suddenly you could fit entire documents instead of paragraphs. The 128K-to-1M jump was useful for codebases and large document sets. The 1M-to-10M jump is... theoretical. I can't name a production use case where 10M tokens of context is the right solution and RAG + 1M isn't.
The future isn't bigger context windows. It's smarter context engineering. Filling the window with the right 100K tokens is worth more than filling it with 10M irrelevant ones. Andrej Karpathy said it best: context engineering is "the delicate art and science of filling the context window with just the right information for the next step."
That's the skill that matters. Not how big your window is. How well you use it.
Sources
- Meta AI -- Llama 4: Open, Multimodal Intelligence
- HuggingFace -- Welcome Llama 4
- Prompt Injection -- Llama 4 Scout: Total Disaster or Misunderstood Workhorse?
- Interconnects.ai -- Llama 4: Did Meta Just Push the Panic Button?
- Digital Applied -- AI Context Window Comparison 2026
- Chroma Research -- Context Rot
- Liu et al. -- Lost in the Middle (MIT Press)
- Stanford -- Lost in the Middle (arXiv)
- Sander Ali Khowaja -- Analysis of Llama 4's 10M Token Claim
- vLLM -- Llama 4 in vLLM
- Composio -- Notes on Llama 4: Hits, Misses, and Disasters
- Unwind AI -- RAG Is Not Dead with Llama 4's 10M Context
- Redis -- RAG vs Large Context Window
- Andrej Karpathy -- Context Engineering (X/Twitter)
- NVIDIA Developer -- Accelerating Inference on Llama 4
- Magic.dev -- 100M Token Context Windows
- Elvex -- Context Length Comparison 2026
- Morph -- LLM Context Window Comparison 2026
- LLM Stats -- Llama 4 Maverick vs Scout
- APXML -- Llama 4 System Requirements
- Andri -- Llama 4 Context: Bigger Isn't Always Better for Legal AI
- TorontoStarts -- Unpacking the Llama 4 Benchmark Controversy
- Neowin -- Unmodified Llama 4 Maverick Ranks #32
- Slashdot -- Meta AI Chief Confirms Benchmark Manipulation
- Google Cloud -- Supercharging AI Coding Assistants with Massive Context
- Gemini 1.5 Technical Report (arXiv)
- Claude Context Windows (Anthropic Docs)
- OpenAI -- GPT-5.4 Model Docs
- Google DeepMind -- Gemini 3.1 Pro Model Card
- xAI Developer Docs -- Models
- AwesomeAgents -- Long Context Benchmarks Leaderboard
- Uplatz -- Llama 4 Scout Technical Analysis