Ismat Samadov
  • Tags
  • About

© 2026 Ismat Samadov

RSS
13 min read/1 views

Llama 4 Scout's 10M Token Context Window: What You Can Actually Do With It

Meta shipped 10M-token context. The model scores 15.6% at 128K tokens. Here's what actually works and what doesn't.

AILLMOpinionMachine Learning

Related Articles

SLOs Changed How We Ship Software — Error Budgets, Burn Rates, and Why 99.99% Uptime Is a Lie

15 min read

OWASP Top 10 for LLM Applications: The Attacks Your AI App Isn't Ready For

15 min read

Remote Work Killed Mentorship — How Senior Engineers Can Fix It

13 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Context Window Landscape in 2026
  • How Meta Actually Built a 10M Context Window
  • What 10M Tokens Actually Looks Like
  • The Performance Gap: Advertised vs. Effective Context
  • The 60-70% Rule
  • Context Rot Is Universal
  • The Lost-in-the-Middle Problem
  • Scout's Actual Benchmarks
  • The Hardware Reality Check
  • What You Can Actually Do With It
  • Where Long Context Wins
  • Where Long Context Fails
  • RAG Is Not Dead. Stop Saying RAG Is Dead.
  • The Benchmark Controversy You Should Know About
  • A Practical Decision Framework
  • For Codebase Analysis (Your Best Use Case)
  • For Document Processing
  • For Choosing a Long-Context Model
  • Context Engineering Best Practices
  • What I Actually Think
  • Sources

Llama 4 Scout claims a 10-million-token context window. That's roughly 15,000 pages of text, 500,000 lines of code, or the entire Harry Potter series fifteen times over. When Meta announced it in April 2025, the number dominated every headline. What didn't make headlines: on Fiction.LiveBench -- a test that measures whether a model actually understands what it reads -- Scout scored 15.6% at just 128K tokens. Gemini 2.5 Pro scored 90.6% on the same test.

The model that claims to read 10 million tokens can barely comprehend 128 thousand.

This is the story of the context window arms race: where the marketing is enormous, the reality is complicated, and "more context" is almost never the answer you think it is.


The Context Window Landscape in 2026

Every major AI lab now ships at least 1 million tokens of context. The race has produced a clear hierarchy:

ModelContext WindowProviderTypeRelease
LTM-2-Mini100M tokensMagic.devResearch/privateAug 2024
Llama 4 Scout10M tokensMetaOpen-weightApr 2025
Grok 4.202M tokensxAIProprietaryMar 2026
Claude Opus 4.61M tokensAnthropicProprietaryMar 2026
GPT-5.41.05M tokensOpenAIProprietaryMar 2026
Gemini 3.1 Pro1M tokensGoogleProprietaryFeb 2026
Llama 4 Maverick1M tokensMetaOpen-weightApr 2025
Qwen 3.6 Plus1M tokensAlibabaProprietaryMar 2026
Mistral Small 4256K tokensMistralOpen-weightMar 2026
DeepSeek V3.2128K tokensDeepSeekOpen-weight2025

Two years ago, 128K was impressive. Now five models offer 1M+, two offer 2M+, and Meta claims 10M. Context size has become table stakes. The question isn't "how big" anymore. It's "how well."


How Meta Actually Built a 10M Context Window

The engineering behind Scout's context window is genuinely clever, even if the end result disappoints.

The core innovation is iRoPE -- Interleaved Rotary Position Embeddings. Here's how it works:

Every four transformer layers follow a repeating pattern:

  • Layers 1-3: Standard RoPE positional encoding with chunked attention -- each token can only attend to other tokens within its local 8,192-token chunk. This is cheap. It's basically sliding-window attention.
  • Layer 4: No positional encoding (NoPE) with full causal attention -- this layer sees the entire context. All 10 million tokens. Temperature scaling prevents the softmax attention from collapsing to near-zero at extreme lengths.

So 75% of the model's layers only look at 8K tokens at a time. The other 25% periodically sweep the full context. This is how you make 10M tokens computationally feasible -- you cheat, elegantly. Most of the model operates locally. A minority of layers operate globally.

The model was pre-trained on over 30 trillion tokens across 200 languages, but at a 256K context length -- not 10M. The jump from 256K to 10M happens through length generalization during instruction tuning. The model was never trained on 10M-token sequences. It extrapolates.

This matters. A lot.


What 10M Tokens Actually Looks Like

Before we talk about whether it works, let's ground the number in reality.

Content TypeApproximate Token CountFits in 10M?
One page of text~500 tokensYes (20,000 pages)
Average novel (80K words)~100K tokensYes (~100 novels)
War and Peace + entire Harry Potter~1.5M tokensYes, 6x over
Full codebase (500K lines)~2M tokensYes, 5x over
10 hours of audio transcription~1M tokensYes, 10x over
One hour of video at 1fps~1M tokensYes, 10x over
Entire English Wikipedia~4B tokensNo

Source: estimates from Digital Applied and Gemini 1.5 technical report

The promise is tantalizing. Dump your entire codebase, all your legal contracts, or a semester's worth of lecture transcripts into a single prompt. No chunking, no RAG pipeline, no retrieval step. Just... throw everything in.

The reality is different.


The Performance Gap: Advertised vs. Effective Context

Here's the data that context window marketing doesn't want you to see.

The 60-70% Rule

Research consistently shows that models' effective context -- the range where they maintain reliable performance -- is roughly 60-70% of their advertised maximum. A 200K model becomes unreliable around 130K. A 1M model starts degrading around 600-700K.

For Scout's 10M, applying this rule generously gives an effective range of 6-7M for simple retrieval. For synthesis and reasoning? Much lower.

Context Rot Is Universal

Chroma Research tested 18 frontier models -- including GPT-4.1, Claude Opus 4, and Gemini 2.5 -- and found that every single model exhibited continuous performance degradation as input length increased. Not a cliff. A slope. It starts early and gets worse.

Their most surprising finding: models performed worse on logically structured documents than on randomly shuffled text. Coherent structure triggers recency bias -- the model leans on the last few paragraphs rather than synthesizing the whole document. More structure, paradoxically, means worse long-context performance.

The Lost-in-the-Middle Problem

Stanford and UC Berkeley researchers demonstrated that LLMs exhibit a U-shaped performance curve: they recall information well from the beginning and end of the context, but accuracy drops 30%+ when relevant information is buried in the middle. This isn't a Scout-specific problem. It affects every model. And it gets worse as context grows, because there's more "middle" to get lost in.

Scout's Actual Benchmarks

Meta released almost no long-context evaluations beyond needle-in-a-haystack. Nathan Lambert (Interconnects.ai) noted that the absence of RULER benchmark results and NoLiMa evaluations was conspicuous. NIAH is the easiest possible long-context test -- find one specific sentence in a pile of text. It's necessary but not sufficient.

Independent evaluations told a grim story:

TestScout ScoreCompetitor ScoreGap
Fiction.LiveBench (128K)15.6%Gemini 2.5 Pro: 90.6%-75 points
GPQA Diamond57.2%Maverick: 69.8%-12.6 points
LiveCodeBench32.8%Maverick: 43.4%-10.6 points
ARC-AGI-20.0%(baseline)Failed
Aider Polyglot (coding)16%Qwen 2.5 Coder: higherPoor

At 300K tokens, one independent tester reported the model "collapsed completely" -- failing to identify hidden test sentences and instead generating responses from pre-training knowledge rather than analyzing the provided text.

Maverick (the smaller sibling with 1M context) outperforms Scout on all 11 standard benchmarks. The model with 10x less context is better at everything.


The Hardware Reality Check

Let's say you believe in the 10M promise and want to use it. What does it cost?

The KV cache -- the memory structure that stores information about all previous tokens -- grows linearly with context length. At 10M tokens, one analysis calculated the KV cache alone requires approximately 32TB of memory. A single H100 has 80GB. You'd need roughly 240 H100 GPUs just for the cache.

More conservative estimates still require massive hardware:

Context LengthKV Cache MemoryHardware NeededApproximate Cost
32K tokens~2 GB1x H100 (with model)$20K
128K tokens~8 GB1x H100$20K
1M tokens~64 GB8x H100$160K
3.6M tokens~230 GB8x H200$280K
10M tokens~410-32,000 GB7-240x H100$140K-$4.8M

Source: vLLM Llama 4 blog, hardware analysis, APXML

The range depends on quantization and optimization. But even the optimistic end -- 7 H100s at ~$90/hour in the cloud -- means a single 10M-token query costs real money. Contrast that with RAG, where retrieving 5-10K relevant tokens costs fractions of a penny.

In practice, most API providers cap Scout at 128K to 1M tokens for consistent performance and economic viability.


What You Can Actually Do With It

Enough doom. Here are the use cases where long context genuinely works -- and where it doesn't.

Where Long Context Wins

Entire-codebase analysis. Loading 40-50K lines of code into a 1M context window lets you ask "if I change this function, what breaks?" without building a retrieval pipeline. Sourcegraph testing showed improvements in recall and helpfulness when full codebases fit in context. This is the killer app for long context in 2026.

Bounded document analysis. A set of 50-100 legal contracts. A full quarterly financial report. A patient's complete medical history. When the corpus is fixed, bounded, and fits in context, dumping it all in works. No chunking artifacts. No retrieval misses.

Long-running agent sessions. AI agents that make 20-30+ tool calls accumulate significant context. A 1M window means the agent remembers everything it's done in a session without summarization tricks that lose detail.

Cross-document reasoning. Finding contradictions across five research papers. Comparing clauses across a dozen contracts. Tasks where the answer depends on information scattered across multiple documents and you need the model to see all of them simultaneously.

Where Long Context Fails

Synthesis over massive corpora. Asking "summarize the themes across these 500 documents" sounds like a context window problem. It's not. Beyond ~1-2M tokens, synthesis quality degrades severely. The model can retrieve from 10M tokens (find a specific fact), but it can't reason across 10M tokens (connect ideas from beginning, middle, and end).

Dynamic or frequently updated data. If your knowledge base changes weekly, stuffing it into context every query is expensive and wasteful. RAG with an updated index is orders of magnitude more efficient.

Permission-sensitive data. Long context has no concept of access control. If User A shouldn't see Document B, you can't dump everything into one context. RAG systems can filter by permissions.

Cost-sensitive applications. Processing 10M tokens costs $1.10-$120 per query depending on model and provider. A well-optimized RAG system retrieving the relevant 5-10K tokens costs ~$0.00008. That's a 1,250x cost difference.


RAG Is Not Dead. Stop Saying RAG Is Dead.

Every time a larger context window ships, someone writes "RAG is dead." It's been declared dead after Gemini 1.5's 1M, after Llama 4's 10M, and probably will be after Magic.dev's 100M.

RAG is not dead. The math doesn't support it.

DimensionLong Context (10M)RAG
Cost per query$1.10-$120~$0.00008
Latency30-60 seconds at high token counts~1 second
Compute overhead260% overhead vs 2K context at 128KMinimal per-query
Access controlNonePermission-aware filtering
CitationsWeak unless post-processedBuilt into pipeline
Corpus size limit10M tokens (~15K pages)Unlimited
Data freshnessStale after context creationUpdated with index

The 2026 consensus, and I agree with it: use both. RAG retrieves the most relevant documents from your entire knowledge base. Long context reasons over those retrieved documents. RAG handles scale (millions of documents). Long context handles depth (hundreds of pages). Andrej Karpathy calls this "context engineering" -- the art and science of filling the context window with just the right information for the next step.

The hybrid approach outperforms either method alone. Use RAG to narrow 10 million documents to 100 relevant ones. Load those 100 into a 1M context window. Let the model reason across them. That's the architecture that works.


The Benchmark Controversy You Should Know About

Meta's Llama 4 launch was marred by a credibility crisis that colors everything about the 10M context claim.

Meta submitted a specially crafted, non-public version of Llama 4 Maverick to LM Arena -- one "optimized for conversationality" with verbose, emoji-filled outputs. The public release version was nothing like it. When LM Arena tested the actual release, Maverick dropped from #2 to #32. Scout fell out of the top 100.

LM Arena stated that Meta's interpretation of their policy "did not match what we expect from model providers." A departing Meta AI Chief confirmed: "Results were fudged."

AI commentator Zvi Mowshowitz called it "by far the most negative reaction I have seen to a model release." The LocalLlama subreddit described Scout as "severely underwhelming on all fronts."

When a company manipulates benchmarks on one dimension, it's reasonable to scrutinize their other claims. Including the 10M number.


A Practical Decision Framework

Here's what I'd actually recommend based on the data:

For Codebase Analysis (Your Best Use Case)

Use a 1M context model (Maverick, Claude Opus 4.6, Gemini 3.1 Pro) with your full codebase loaded. Don't bother with Scout's 10M -- Maverick performs better on every coding benchmark at 1M context. If your codebase exceeds 1M tokens, use RAG to select the relevant modules first.

For Document Processing

Under 100 documents: Load them all into 1M context. Simple and effective.

100-1,000 documents: Hybrid approach. RAG retrieves the top 20-50 most relevant. Load those into context for cross-document reasoning.

Over 1,000 documents: Pure RAG. No context window is large enough to hold thousands of documents effectively, and even if it were, the lost-in-the-middle problem would destroy accuracy.

For Choosing a Long-Context Model

PriorityBest ChoiceWhy
Maximum recall qualityClaude Opus 4.676% on MRCR 8-needle at 1M -- best in class
Cost-efficient bulk processingGemini 3.1 Pro$2/MTok input, strong 1M performance
Self-hosted / privateLlama 4 MaverickApache-like license, 1M context, runs on 8x H100
Budget self-hostedQwen 3.5-122BMoE, fits on 1x H100 quantized, 262K context
Pure retrieval at scaleLlama 4 ScoutIf you literally only need "find X in 10M tokens"

Context Engineering Best Practices

  1. Place critical information at the beginning or end of your context. The middle is where facts go to die.
  2. Pre-summarize long documents before loading them. A 5-page summary is more useful in context than 500 raw pages.
  3. Use structured separators -- clear headers, section markers, and document boundaries help the model navigate large contexts.
  4. Monitor your effective token usage. If you're loading 500K tokens but the model only needs 50K of them, you're wasting 10x on compute.
  5. Test at your actual context length. A model that benchmarks well at 32K may fall apart at 256K. Evaluate on your data at your scale.

What I Actually Think

The 10M context window is a marketing number. Not a lie -- the architecture supports it -- but a number designed to generate headlines rather than solve problems.

Here's my evidence. The model was trained at 256K context and extrapolates to 10M through length generalization. The KV cache at 10M tokens requires hardware that most organizations don't have. Independent evaluations show the model collapsing at 300K tokens for synthesis tasks. Meta released no serious long-context benchmarks beyond needle-in-a-haystack. And the company was caught manipulating benchmarks on the same model release.

The useful Scout is a 256K-1M retrieval model. In that range, it's a solid open-weight option for finding specific information in large document sets. It's surprisingly good at low-resource language translation -- Swahili, Georgian, languages that other models struggle with. That's a real strength that Meta barely marketed.

But 10M tokens of genuine comprehension? No. Not with current architecture. Not with length generalization from 256K training. The gap between "can technically accept 10M tokens as input" and "can reason meaningfully over 10M tokens" is enormous.

More broadly, I think the context window arms race has reached diminishing returns. We went from 4K (GPT-3.5) to 128K (GPT-4) to 1M (five models in 2026) in three years. Each jump mattered less than the one before. The 4K-to-128K jump was transformative -- suddenly you could fit entire documents instead of paragraphs. The 128K-to-1M jump was useful for codebases and large document sets. The 1M-to-10M jump is... theoretical. I can't name a production use case where 10M tokens of context is the right solution and RAG + 1M isn't.

The future isn't bigger context windows. It's smarter context engineering. Filling the window with the right 100K tokens is worth more than filling it with 10M irrelevant ones. Andrej Karpathy said it best: context engineering is "the delicate art and science of filling the context window with just the right information for the next step."

That's the skill that matters. Not how big your window is. How well you use it.


Sources

  1. Meta AI -- Llama 4: Open, Multimodal Intelligence
  2. HuggingFace -- Welcome Llama 4
  3. Prompt Injection -- Llama 4 Scout: Total Disaster or Misunderstood Workhorse?
  4. Interconnects.ai -- Llama 4: Did Meta Just Push the Panic Button?
  5. Digital Applied -- AI Context Window Comparison 2026
  6. Chroma Research -- Context Rot
  7. Liu et al. -- Lost in the Middle (MIT Press)
  8. Stanford -- Lost in the Middle (arXiv)
  9. Sander Ali Khowaja -- Analysis of Llama 4's 10M Token Claim
  10. vLLM -- Llama 4 in vLLM
  11. Composio -- Notes on Llama 4: Hits, Misses, and Disasters
  12. Unwind AI -- RAG Is Not Dead with Llama 4's 10M Context
  13. Redis -- RAG vs Large Context Window
  14. Andrej Karpathy -- Context Engineering (X/Twitter)
  15. NVIDIA Developer -- Accelerating Inference on Llama 4
  16. Magic.dev -- 100M Token Context Windows
  17. Elvex -- Context Length Comparison 2026
  18. Morph -- LLM Context Window Comparison 2026
  19. LLM Stats -- Llama 4 Maverick vs Scout
  20. APXML -- Llama 4 System Requirements
  21. Andri -- Llama 4 Context: Bigger Isn't Always Better for Legal AI
  22. TorontoStarts -- Unpacking the Llama 4 Benchmark Controversy
  23. Neowin -- Unmodified Llama 4 Maverick Ranks #32
  24. Slashdot -- Meta AI Chief Confirms Benchmark Manipulation
  25. Google Cloud -- Supercharging AI Coding Assistants with Massive Context
  26. Gemini 1.5 Technical Report (arXiv)
  27. Claude Context Windows (Anthropic Docs)
  28. OpenAI -- GPT-5.4 Model Docs
  29. Google DeepMind -- Gemini 3.1 Pro Model Card
  30. xAI Developer Docs -- Models
  31. AwesomeAgents -- Long Context Benchmarks Leaderboard
  32. Uplatz -- Llama 4 Scout Technical Analysis