Ismat Samadov
  • Tags
  • About
12 min read/1 views

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro: Same Benchmarks, Different Strengths

All three score ~57 on the Intelligence Index. Claude leads coding quality, Gemini leads math, GPT leads speed. Which to use when.

AILLMToolsOpinion

Related Articles

OpenAI, Anthropic, Databricks: The Largest AI IPO Wave in History Is Coming

17 min read

The 10M-Token Context Window vs the $1M/Day Inference Bill: AI's Fundamental Economics Problem

17 min read

The Specialist vs Generalist Divide: Why the 2026 Job Market Rewards Depth Over Breadth

16 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Benchmark Reality: Closer Than You Think (And Further Apart)
  • What the Benchmarks Don't Capture
  • The Pricing Gap Is Bigger Than You Think
  • The Coding Showdown: It's Not What You'd Expect
  • The Multimodal Gap Nobody Talks About
  • Context Windows: The Numbers Lie
  • Speed: The Uncomfortable Trade-Off
  • The Decision Framework: Which Model for Which Job
  • The Multi-Model Setup That Actually Works
  • What Other Articles Get Wrong
  • What I Actually Think
  • Sources

© 2026 Ismat Samadov

RSS

Three frontier AI models launched within 30 days of each other in early 2026. Claude Opus 4.6 on February 5. Gemini 3.1 Pro on February 19. GPT-5.4 on March 5. All three score within 4 points of each other on the Artificial Analysis Intelligence Index --- the composite benchmark that's become the industry's closest thing to an IQ test for LLMs. And yet, using them feels completely different.

I've been running all three daily for the past month --- writing code, analyzing documents, building projects. The benchmarks say they're equals. In practice, they're specialists wearing generalist clothing. Here's what nobody tells you about picking between them.


The Benchmark Reality: Closer Than You Think (And Further Apart)

Let's start with the numbers, because they're genuinely surprising.

BenchmarkClaude Opus 4.6GPT-5.4Gemini 3.1 Pro
Intelligence Index535757
SWE-bench Verified80.8%78.2%80.6%
GPQA Diamond91.3%92.8%94.3%
MATH-50094.1%97.2%95.1%
HumanEval90.4%95.1%89.2%
ARC-AGI-275.2%73.3%77.1%
MMMU Pro (Vision)85.1%81.2%80.5%

Sources: Artificial Analysis, MindStudio, Google DeepMind Model Card, OpenAI

On SWE-bench Verified --- the coding benchmark everyone obsesses over --- the top three models are within 0.6 percentage points of each other. That's noise, not signal. If you're picking a model based on SWE-bench alone, you're optimizing for the wrong thing.

But look closer. The differences emerge in the categories that benchmarks measure poorly.

Claude Opus 4.6 scores 8.6 out of 10 on prose quality. GPT-5.4 scores 7.4. Gemini 3.1 Pro scores 6.9. That's a 25% gap between best and worst on something most comparison articles completely ignore.

Gemini leads GPQA Diamond at 94.3% --- a benchmark that tests PhD-level science questions. Claude trails at 91.3%. Three percentage points might not sound like much, but on graduate-level physics and chemistry, that gap is real.

GPT-5.4 hits 97.2% on MATH-500 --- the highest score from any model on this benchmark. Nearly perfect math. And on HumanEval (Python code generation), it leads at 95.1% while Gemini trails at 89.2%.

The Intelligence Index says they're tied. The details say they're playing different games.


What the Benchmarks Don't Capture

Here's what I've learned from actually using these models, not just reading comparison tables.

Claude Opus 4.6 remembers constraints. Give it a system prompt with 15+ specific instructions --- formatting rules, tone requirements, technical constraints, edge cases to handle. Claude follows all of them. GPT-5.4 and Gemini start dropping constraints around instruction 10. This isn't in any benchmark, but it determines whether your AI-assisted workflow actually works.

GPT-5.4 is the structured output king. When I need JSON, function calls, or tool-use responses, GPT-5.4 rarely produces malformed output. Claude occasionally wraps things in extra markdown. Gemini sometimes hallucinates fields. For production pipelines where parsing reliability matters, GPT-5.4 is the safest bet.

Gemini 3.1 Pro processes absurd amounts of context. I fed it an entire codebase --- 600K tokens --- and asked it to identify architectural inconsistencies. It found things that took me two hours to verify manually. Claude handles large context well too (its 1M context went GA on March 13, 2026), but Gemini's 2M token window gives it room that nothing else matches.


The Pricing Gap Is Bigger Than You Think

Let's talk money, because this is where the "they're all equal" narrative falls apart completely.

ModelInput (per 1M tokens)Output (per 1M tokens)Long-context surcharge
Gemini 3.1 Pro$2.00$12.002x above 200K
GPT-5.4$2.50$15.002x above 272K
Claude Opus 4.6$5.00$25.002x input / 1.5x output above 200K

Sources: Anthropic pricing, OpenAI pricing, Google AI pricing

Claude Opus 4.6 costs 2.5x more on input and 2.1x more on output than Gemini 3.1 Pro. That's not a rounding error. On a production workload processing 100 million tokens per month, the difference is:

  • Gemini 3.1 Pro: $1,400
  • GPT-5.4: $1,750
  • Claude Opus 4.6: $3,000

And those are base prices. The long-context surcharges make the gap even wider. Claude's input price doubles to $10 per million tokens above 200K context. GPT-5.4's doubles above 272K. Gemini doubles above 200K --- but starts at half the price, so doubled Gemini is still cheaper than standard Claude.

Here's the kicker: all three offer batch API discounts around 50%. And Claude's prompt caching can cut input costs by up to 90%. So the effective cost depends heavily on your access pattern. If you're sending the same system prompt repeatedly (which most production apps do), Claude's caching makes it much more competitive than the sticker price suggests.


The Coding Showdown: It's Not What You'd Expect

I've seen dozens of "best AI for coding" articles that pick a winner based on a single benchmark. That's lazy. Here's what the data actually shows.

SWE-bench tells one story. SWE-bench Pro tells another.

Claude Opus 4.6 leads SWE-bench Verified at 80.8%. But on SWE-bench Pro --- the harder, more ambiguous variant --- GPT-5.4 scores 57.7% while Claude drops to roughly 45%. That's a massive 13-point swing. On standard bugs, Claude wins. On harder, more ambiguous problems, GPT takes over.

Terminal-Bench paints a different picture. GPT-5.4 leads at 75.1%, with Gemini at 68.5% and Claude at 65.4%. Terminal-Bench tests real command-line interactions --- creating files, running scripts, configuring environments. GPT-5.4's computer-use capabilities give it a clear edge here.

What developers actually report: Claude Code with Opus 4.6 achieves roughly 95% first-pass correctness on standard tasks --- meaning generated code works without modification. Most developers surveyed say they use Sonnet for 70-80% of daily coding and escalate to Opus for the complex 20-30%, saving about 60% on API costs.

The emerging pattern among productive developers: use both GPT-5.4 and Claude Code. GPT for quick prototypes and structured output. Claude for deep refactoring and multi-file reasoning. This "dual-wielding" approach is becoming the norm, not the exception.

TaskBest ModelWhy
Standard bug fixesClaude Opus 4.680.8% SWE-bench, best first-pass accuracy
Ambiguous / hard problemsGPT-5.457.7% SWE-bench Pro, 13 points above Claude
Terminal / CLI automationGPT-5.475.1% Terminal-Bench, computer-use native
Multi-file refactoringClaude Opus 4.6Agent Teams, 128K output, deep context reasoning
Quick prototypingGPT-5.4Faster, cheaper, good enough
Infrastructure scriptsGemini 3.1 ProFast, cheap, large context for config files

The Multimodal Gap Nobody Talks About

This one surprised me. The three models have radically different multimodal capabilities, and most comparison articles gloss over it.

CapabilityGPT-5.4Claude Opus 4.6Gemini 3.1 Pro
Text inputYesYesYes
Image inputYesYesYes
Audio inputYesNoYes (up to 8.4 hours)
Video inputLimitedNoYes (up to 1 hour)
PDF processingYesYes (up to 600 pages)Yes (up to 900 pages)
Computer useYes (native)YesNo

Sources: DataStudios comparison, Google DeepMind

Claude Opus 4.6 cannot process audio or video. At all. If your workflow involves transcribing meetings, analyzing video content, or processing podcast audio, Claude is out. Gemini 3.1 Pro is the only model that handles text, image, audio, and video natively in a single API call.

GPT-5.4 introduced something genuinely new: a Computer Use API that lets the model see your screen, move the cursor, click, type, and interact with desktop applications. It scores 75% on OSWorld --- the desktop automation benchmark --- which actually surpasses human performance at 72.4%. That's a first for any general-purpose model.

If you're building anything that needs to see, hear, and interact with the real world, the model choice is constrained before you even look at text benchmarks.


Context Windows: The Numbers Lie

Every model now advertises a 1M+ token context window. Here's why you shouldn't trust those numbers.

Advertised vs. effective context is about 60-70% in practice. A model claiming 200K tokens typically becomes unreliable around 130K. The "lost in the middle" effect --- where information buried in the center of long contexts is harder to retrieve than at the beginning or end --- still hasn't been fully solved.

Gemini 3.1 Pro advertises 2M tokens, the largest window of the three. But its performance on the MRCR v2 benchmark (which tests long-context recall) drops from 84.9% at 128K to just 26.3% at 1M tokens. That's a 69% quality drop when you actually use the full window.

Claude Opus 4.6? It scores 76% on MRCR v2 at both 128K and 1M tokens. The recall barely degrades as context grows. That's a genuinely important architectural difference that the "2M vs 1M" headline completely obscures.

GPT-5.4's 1M context requires tier 4+ API access --- not available to most developers. And practical quality reportedly degrades above 800K tokens. The default window is 272K.

So the real comparison is:

ModelAdvertised WindowEffective WindowLong-Context Quality
Gemini 3.1 Pro2MBest at 128K, degrades at 1M26.3% recall at 1M
Claude Opus 4.61MConsistent to 1M76% recall at 1M
GPT-5.41M (272K default)Good to ~800KNot publicly benchmarked at 1M

The model with the biggest window has the worst recall at full capacity. The model with the smallest effective window has the best recall at max length. Size isn't everything.


Speed: The Uncomfortable Trade-Off

Here's the thing nobody wants to admit: the smartest models are the slowest.

Gemini 3.1 Pro is the fastest of the three frontier models at approximately 109.5 tokens per second. GPT-5.4 is consistently reported as faster than Claude at equivalent quality levels. Claude Opus 4.6 is the slowest of the three, though exact t/s figures vary by provider.

For context, smaller models like Mercury 2 hit 870 tokens per second. Llama 4 Scout reaches 2,600 t/s. The frontier models operate at a fraction of that speed.

This matters more than benchmarks for certain use cases. If you're building a chatbot that needs sub-second responses, Claude Opus 4.6 is probably the wrong choice. If you need an inline code completion that feels instantaneous, you want GPT-5.4 or even better, a smaller model fine-tuned for speed.

The practical solution? Most developers I've talked to use a tiered approach:

  1. Fast model (GPT-5.4 Mini, Gemini Flash, Claude Sonnet) for 70-80% of interactions
  2. Frontier model for the remaining 20-30% where quality matters

The 73% of engineering teams using AI daily aren't running Opus on every keystroke. They're routing.


The Decision Framework: Which Model for Which Job

Stop looking for "the best model." Start asking "best for what?"

Use Claude Opus 4.6 when:

  • You need deep, multi-file code analysis or refactoring
  • You're processing long documents and need consistent recall at 500K+ tokens
  • Writing quality matters (blog posts, documentation, reports)
  • You have complex system prompts with 15+ constraints
  • You need the largest output window (128K tokens, 300K via batch API)
  • Legal, compliance, or nuanced reasoning tasks (90.2% BigLaw Bench)

Use GPT-5.4 when:

  • You need reliable structured output (JSON, function calls)
  • Speed matters more than maximum quality
  • You're building desktop automation or computer-use workflows
  • You want the cheapest frontier model per token ($2.50/$15)
  • Math-heavy tasks (97.2% MATH-500)
  • Prototyping where "good enough fast" beats "perfect slow"

Use Gemini 3.1 Pro when:

  • You need audio or video processing (only model with native support)
  • You're working with massive documents (2M token window)
  • Cost is the primary constraint ($2/$12 per million tokens)
  • PhD-level science reasoning (94.3% GPQA Diamond)
  • You need web search integrated into responses
  • High-volume batch processing where per-token cost compounds

Use multiple models when:

  • You're a production team. Honestly? If you're not routing between models in 2026, you're leaving performance and money on the table.

The Multi-Model Setup That Actually Works

Here's the pattern I see the most productive teams using.

# Model routing configuration (pseudocode)
routing:
  default: gpt-5.4-mini          # Fast, cheap, handles 80% of requests
  code_review: claude-opus-4.6    # Deep analysis, catches edge cases
  code_generation: gpt-5.4        # Fast prototyping, reliable output
  refactoring: claude-opus-4.6    # Multi-file reasoning
  document_analysis: gemini-3.1-pro  # Huge context, cheap
  audio_video: gemini-3.1-pro     # Only option with native support
  writing: claude-opus-4.6        # Best prose quality
  structured_output: gpt-5.4      # Most reliable JSON/function calls
  math_science: gpt-5.4           # 97.2% MATH-500

This isn't theoretical. Developer surveys show that the "dual-wielding" pattern is dominant among the most productive engineers. Model routing is becoming a core skill, not a nice-to-have.

The tools to support this are maturing fast. Portkey, OpenRouter, and similar gateways let you route requests to different models based on task type, cost constraints, or latency requirements. Set it up once, save 40-60% on API costs while getting better results.


What Other Articles Get Wrong

Most "GPT vs Claude vs Gemini" articles make three mistakes.

Mistake 1: Treating benchmarks as the whole story. SWE-bench Verified has a 0.6-point spread across the top three models. The difference is statistically meaningless. But the writing quality gap is 25%. The multimodal gap is binary (Claude can't process audio/video at all). The real differences are in the dimensions that benchmarks don't measure well.

Mistake 2: Ignoring effective context versus advertised context. Gemini's 2M window sounds amazing until you learn that recall drops to 26.3% at 1M tokens. Claude's 1M window with 76% recall at full capacity is more useful in practice. The bigger number isn't always the better number.

Mistake 3: Picking a single winner. There is no single best model. The 73% of developers using AI daily aren't loyal to one provider. GPT-5.4 holds 82% overall usage, but Claude sits at 44% for complex tasks. The same developers use both. Model monogamy is dead.


What I Actually Think

I use Claude Opus 4.6 as my primary model. I'll be upfront about that bias.

But it's not because Claude wins on benchmarks --- the margins are too thin for that to be the reason. It's because Claude handles complexity better than anything else I've used. When I give it a 2,000-word system prompt with specific formatting rules, tone requirements, and edge-case handling, it follows all of them. GPT-5.4 follows most of them. Gemini follows some of them.

For code, I think the SWE-bench plateau is telling. The top models are within 0.6 points of each other on standard bug-fixing. The differentiation has moved to harder problems --- and here, GPT-5.4's 57.7% on SWE-bench Pro versus Claude's ~45% is a real gap. If you're working on genuinely novel, ambiguous problems, GPT-5.4 might be the better tool.

Gemini 3.1 Pro is the dark horse. It's the cheapest, it has the best multimodal support, and its 94.3% GPQA Diamond score is genuinely impressive for scientific reasoning. If Google tightens up instruction-following and long-context recall, Gemini becomes a serious threat to both.

Here's my honest prediction: by the end of 2026, the model you use won't matter as much as how you route between them. The winning strategy isn't picking the best model. It's building a system that picks the right model for each task automatically. The developers who figure this out first will have a compounding advantage over everyone still arguing about which model is "the best."

The benchmark convergence isn't a sign that AI progress is slowing. It's a sign that the era of one model to rule them all is over. The future is multi-model, task-specific, and cost-optimized. The sooner you accept that, the better your results will be.


Sources

  1. Artificial Analysis --- Intelligence Index
  2. MindStudio --- GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro Benchmarks
  3. Google DeepMind --- Gemini 3.1 Pro Model Card
  4. OpenAI --- Introducing GPT-5.4
  5. DigitalApplied --- Best Frontier Model Comparison
  6. Claude5 Hub --- Opus 4.6 Deep Dive
  7. Vals.ai --- SWE-bench Leaderboard
  8. Faros.ai --- Best AI Model for Coding 2026
  9. Evolink --- Developer Comparison 2026
  10. NxCode --- GPT-5.4 Complete Guide
  11. Anthropic --- Claude API Pricing
  12. OpenAI --- API Pricing
  13. Google AI --- Gemini API Pricing
  14. DataStudios --- Full Report and Comparison
  15. Elvex --- Context Length Comparison 2026
  16. SiliconANGLE --- Claude Opus 4.6 1M Context
  17. Artificial Analysis --- LLM Leaderboard
  18. Developer Survey 2026 --- AI Coding Tool Adoption
  19. Chandler Nguyen --- Dual-Wielding AI Coding Tools
  20. Portkey --- GPT-5.4 vs Claude Opus 4.6 Guide
  21. Anthropic --- Claude Models Overview
  22. NxCode --- GPT-5.4 vs Claude Opus 4.6 Coding Comparison