Ismat Samadov
  • Tags
  • About
16 min read/3 views

The METR Study: AI Tools Made Experienced Developers 19% Slower

A rigorous RCT found AI coding tools slowed down experienced developers by 19%. The developers themselves believed they were 20% faster. The perception-reality gap changes everything.

AIDeveloper ProductivitySoftware EngineeringResearchDeveloper Tools

Related Articles

Semantic Caching Saved Us $14K/Month in LLM API Costs

14 min read

LLM Evals Are Broken — How to Actually Test Your AI App Before Users Do

14 min read

Technical Debt Is a Lie Engineers Tell Managers

13 min read

Enjoyed this article?

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

On this page

  • The Study Design
  • The Perception-Reality Gap
  • Five Factors Behind the Slowdown
  • 1. Over-Optimism About AI Usefulness
  • 2. Extreme Developer Familiarity
  • 3. Large, Complex Codebases
  • 4. Low Code Acceptance Rates
  • 5. Missing Tacit Knowledge
  • Inside the Study: One Developer's Account
  • The Counter-Evidence: Studies Showing AI Helps
  • The Quality Problem
  • The 2026 Update: METR Tries Again
  • What the Experts Say
  • The Bottleneck Migration Problem
  • What This Means for You
  • The Uncomfortable Truth

© 2026 Ismat Samadov

RSS

In July 2025, a small research organization called METR published a 51-page randomized controlled trial that detonated across the tech industry. The headline: experienced open-source developers using AI tools completed tasks 19% slower than without them. Not junior developers. Not unfamiliar codebases. Seasoned maintainers, working on their own repositories, with the best AI tools money could buy.

Before the study, those same developers predicted AI would make them 24% faster. After the study, they still believed AI had made them 20% faster. The actual measurement said the opposite.

This is the most important study about AI-assisted development published to date. Not because it proves AI tools are useless -- they aren't. But because it reveals something far more uncomfortable: the gap between how productive we feel and how productive we are might be the defining challenge of the AI coding era.


The Study Design

METR (Model Evaluation & Threat Research) ran a pre-registered randomized controlled trial from February to June 2025. The design was rigorous enough to survive peer scrutiny:

Participants: 16 experienced open-source developers from large, established repositories. These weren't random GitHub users. The repositories averaged 22,000+ stars and over 1 million lines of code. Developers averaged 5 years of experience and 1,500 commits on their respective projects. They had extensive prior experience with LLM-based tools -- dozens to hundreds of hours of prompting.

Tasks: 246 real issues across those repositories -- bug fixes, features, and refactors averaging roughly 2 hours each. Critically, these were tasks the developers themselves identified as valuable. No synthetic benchmarks. No toy problems. Real work.

Method: Each task was randomly assigned as AI-allowed or AI-disallowed. When AI was allowed, developers primarily used Cursor Pro with Claude 3.5/3.7 Sonnet -- the frontier tools at the time. Developers recorded their screens and self-reported implementation times.

Compensation: $150 per hour. METR wasn't skimping on incentives, and developers had every reason to perform well in both conditions.

The result: allowing AI increased task completion time by 19%, with a 95% confidence interval of +2% to +39%. Multiple alternative statistical estimators were tested. The result held.


The Perception-Reality Gap

Here's the number that should keep every engineering leader awake at night:

Prediction SourceExpected AI SpeedupActual Result
Developers (before study)+24% faster-19% slower
Developers (after study)+20% faster-19% slower
Expert economists+39% faster-19% slower
ML researchers+38% faster-19% slower

Everyone -- developers, economists, AI researchers -- predicted substantial speedups. The developers who lived through the study still believed AI had helped them, even after seeing their own screen recordings. The gap between perceived performance (+20%) and actual performance (-19%) represents a 39-percentage-point disconnect in self-assessment accuracy.

Sean Goedecke, an engineer at GitHub, proposed an explanation: "Coding with AI requires less cognitive effort." Watching code generate feels productive. Less typing creates a sensation of less work. The reduced fatigue masks the time spent reviewing, debugging, and correcting AI output. You feel faster because you're less tired, even though the clock says otherwise.

This isn't a minor calibration error. This is the kind of systematic bias that corrupts every "AI saved us X hours" metric in every quarterly business review. If developers can't accurately assess their own productivity even when explicitly asked, what are self-reported survey numbers actually measuring?


Five Factors Behind the Slowdown

METR investigated 20 potential explanations and identified five that likely contributed:

1. Over-Optimism About AI Usefulness

Developers had unrealistic expectations about where AI would help. They used AI on tasks that would have been faster by hand, burning time on prompt engineering, reviewing output, and debugging generated code that a few minutes of focused manual coding would have resolved.

2. Extreme Developer Familiarity

With an average of 5 years and 1,500 commits on their repos, these developers were already operating near peak efficiency. They knew every convention, every architectural quirk, every shortcut. AI had almost no room to add value when the developer could already write the code from muscle memory.

3. Large, Complex Codebases

Repositories with 1M+ lines of code, rigorous documentation requirements, comprehensive test suites, and strict linting rules created environments where AI consistently missed implicit context. The AI could generate code that looked correct but violated unstated conventions that any experienced contributor would know instinctively.

4. Low Code Acceptance Rates

Developers accepted fewer than 44% of AI-generated suggestions. More than half of what the AI produced was discarded after review. The time spent generating, reading, evaluating, and rejecting suggestions was pure overhead.

5. Missing Tacit Knowledge

One participant described AI as operating "like a new contributor who doesn't yet understand the codebase." The AI lacked the contextual understanding of why certain patterns existed, which trade-offs had been made, and what the implicit quality bar looked like. It could produce code. It couldn't produce code that belonged in the project.


Inside the Study: One Developer's Account

Domenic Denicola, maintainer of jsdom -- a JavaScript browser engine with over 1 million lines of code -- published a detailed account of his participation. He completed 19 tasks over four weeks: 9 with AI, 10 without.

His primary tools were Cursor's agent mode, Claude Sonnet 3.5/3.7, and Gemini 2.5 Pro. His findings were granular and damning:

  • Hallucinated constants. The model fabricated a CSSRule.LAYER_STATEMENT_RULE constant that doesn't exist in any specification.
  • Missed simple conventions. Despite explicit instructions, models repeatedly failed to include issue links in test headers -- a basic repository convention.
  • Poor file navigation. AI struggled to find relevant files in a large directory structure, a task Denicola could do from memory in seconds.
  • Linter struggles. The AI got stuck on simple formatting fixes, making repetitive single-line edits when a developer would have fixed the pattern in one pass.
  • Outdated training data. Models relied on stale knowledge rather than reading the actual specifications Denicola pointed them to.

His one positive finding: test generation. After careful prompt refinement, the AI could produce "three or four tests in a row with no changes needed." But getting to that point took substantial setup time, and the total time savings were marginal.

His conclusion was blunt: the problem wasn't unfamiliarity with the tools. He'd used them extensively. The problem was fundamental AI limitations on complex, established codebases.


The Counter-Evidence: Studies Showing AI Helps

The METR study doesn't exist in a vacuum. Multiple studies show significant productivity gains:

StudyYearSampleFindingContext
Google Internal RCT202496 engineers21% fasterUnfamiliar enterprise codebase
Microsoft/Accenture RCT20244,867 devs26% more tasksCopilot code completions
GitHub Research2022-234,800 devs55% fasterControlled coding task
Harvard/BCG Study2023758 consultants12-25% faster inside AI frontierKnowledge work tasks
DORA Report 20252025~5,000 pros80%+ report gainsSelf-reported survey

How do you reconcile "19% slower" with "55% faster"?

The answer is context. Every study that shows large gains shares specific conditions that differ from METR:

Unfamiliar codebases. Google's RCT tested developers on code they didn't know well. AI is genuinely helpful for navigation and comprehension in unfamiliar territory. METR tested developers on code they knew intimately.

Simpler tasks. GitHub's 55% number came from a self-contained coding exercise. METR tested complex, multi-file issues in production repositories with testing, documentation, and linting requirements.

Code completions vs. agents. Microsoft's study measured basic autocomplete suggestions, not agentic AI writing entire functions. Autocomplete has a much lower failure cost -- you glance, accept or reject, move on.

Self-reported vs. measured. DORA's 80%+ number is what developers believe. METR showed that developers' beliefs about their own productivity can be systematically wrong.

Harvard's "jagged technological frontier" concept provides the best framework: AI dramatically helps with tasks inside its capabilities and actively hurts with tasks outside them. The frontier isn't uniform -- it's jagged, unpredictable, and highly context-dependent.


The Quality Problem

Speed isn't the only dimension where AI introduces problems.

Uplevel (2024) studied 800 developers and found that Copilot users produced 41% more bugs with no measurable improvement in development speed. The tool was a net negative.

GitClear (2024-2025) analyzed 211 million changed lines from repositories at Google, Microsoft, Meta, and enterprise companies. They found:

  • Code duplication: 4x more code cloning -- copy-pasted lines rose from 8.3% to 12.3% between 2021 and 2024
  • Code churn: New code revised within two weeks grew from 3.1% in 2020 to 5.7% in 2024
  • Refactoring collapsed: From 25% of changed lines in 2021 to less than 10% in 2024

Developers were writing more code. They were maintaining less of it. They were copying instead of abstracting.

Faros AI (2025) studied 10,000+ developers across 1,255 teams and found the throughput paradox at organizational scale: high AI adoption teams merged 98% more PRs and completed 21% more tasks. But PR review time ballooned 91%, PR sizes grew 154%, and bugs increased 9%. Individual productivity rose. Organizational delivery speed didn't.

The pattern is consistent: AI accelerates code production while degrading code quality and overwhelming downstream processes. You write faster. You debug longer. The system ships at the same pace or slower.


The 2026 Update: METR Tries Again

In February 2026, METR published a follow-up acknowledging significant problems with their second study:

The new round expanded to 57 developers (10 returning, 47 new), 143 repositories, and 800+ tasks. Compensation dropped to $50/hour. The results were inconclusive:

  • Returning developers: -18% speedup (95% CI: -38% to +9%)
  • New developers: -4% speedup (95% CI: -15% to +9%)

Both confidence intervals cross zero. Neither result is statistically significant. But the direction shifted -- AI might now actually be helping returning developers, though the data can't confirm it.

More telling was the selection bias METR discovered: 30-50% of developers told researchers they were choosing not to submit tasks because they didn't want to do them without AI. Others refused to participate at all without guaranteed AI access, even at $50/hour. The study was systematically missing the developers and tasks with the highest expected AI benefit.

METR's own conclusion was candid: their new data "gives an unreliable signal" and they believe "developers are more sped up from AI tools now -- in early 2026 -- compared to estimates from early 2025."

This matters. The tools improved. Claude 3.5 became Claude 4. Cursor shipped better context management. The 19% slowdown was a snapshot of a specific moment with specific tools. It shouldn't be treated as a permanent verdict.


What the Experts Say

The study generated reactions from across the industry:

David Cramer (Sentry) experimented with AI agents building a real service and concluded the output was "absolutely unmaintainable." He could not ship what AI produced.

Armin Ronacher found 95% of agentic workflows failed and concluded the most useful AI interaction was simple conversation -- asking questions, not generating code.

Simon Willison, one of the most prominent AI developer advocates, claims LLMs make him "2-5x more productive for the coding portions" of his work. But he immediately qualifies: coding is only a fraction of his job. And responsible AI-assisted development requires operating "at the top of your game."

Addy Osmani (Google Chrome team) emphasizes that "almost everything that makes someone a senior engineer -- designing systems, managing complexity -- is what yields the best outcomes with AI." The tool amplifies expertise. It doesn't replace it.

Kent Beck (creator of TDD and Extreme Programming) distinguishes between "augmented coding" and "vibe coding" -- the former maintains strict quality enforcement with AI doing the typing, the latter abandons oversight entirely.

The consensus among practitioners who've worked deeply with AI tools: the benefit is real but narrow, the measurement is broken, and the marketing is light-years ahead of the evidence.


The Bottleneck Migration Problem

Here's the structural issue that most productivity discussions miss entirely.

When coding accelerates, the bottleneck doesn't disappear. It migrates. PR volume increases, so review queues grow. Review queues grow, so merge cycles slow. More code ships faster, so QA becomes saturated. Security validation lags behind the increased surface area.

Gradle's analysis of this "developer productivity paradox" frames it through Amdahl's Law: if coding is 30% of the total delivery time and you make coding 50% faster, you've improved total delivery by 15%. But that 15% gain evaporates if the other 70% slows down because it's now handling more volume.

Faros AI's data confirms this at scale. Individual developers produced more. Teams didn't deliver faster. The bottleneck moved from writing to reviewing, and nobody had accelerated reviewing.

Suhail Patel (Monzo Bank) identified a measurement problem compounding this: AI tool vendors "measure what they can, and that's the number you get!" They report code completion acceptance rates and lines generated. They don't report downstream bug rates, review burden, or maintenance costs. The metrics they sell are a fraction of the story.


What This Means for You

The METR study isn't a verdict on AI tools. It's a calibration event. Here's what the full body of evidence actually tells us:

If you're an experienced developer on a familiar codebase: AI tools may slow you down on complex tasks. Use them selectively -- test generation, boilerplate, exploring unfamiliar APIs -- not as a default for everything.

If you're working on unfamiliar code: AI tools genuinely help with navigation, comprehension, and initial implementation. The Google study's 21% gain was real and significant.

If you're a manager measuring "AI productivity": Stop trusting self-reported surveys. The perception-reality gap is enormous. Measure cycle time, defect rates, and review burden, not developer sentiment.

If you're choosing tools: Autocomplete (Copilot-style suggestions) has a better evidence base than agentic coding (AI writing entire features). The former has lower failure costs and higher acceptance rates. The latter introduces the kind of overhead METR measured.

If you're early in your career: The Harvard/BCG study found junior developers gained the most from AI assistance. But GitClear's data suggests those gains come with a quality cost that may slow skill development. Use AI as a learning aid, not a crutch. Understand what it generates before you ship it.

The most honest assessment comes from Addy Osmani's comprehensive review: AI provides "modest, uneven boosts" rather than transformative gains. Controlled studies show steady 20-30% improvements in specific tasks, not the 10x revolution the marketing promises.


The Uncomfortable Truth

The METR study revealed something the industry doesn't want to hear: we can't yet measure AI's impact on software development with any confidence. The best-designed studies contradict each other. Developers can't accurately assess their own productivity. Vendor metrics are incomplete. Organizational throughput gains are eaten by downstream bottlenecks.

The Stack Overflow 2025 survey found that positive views of AI tools declined from over 70% in 2023 to roughly 60% in 2025. Forty-six percent of developers say they don't trust AI output accuracy. Sixty-six percent cite "almost right, but not quite" as their top frustration. Seventy-two percent said "vibe coding" is not part of their professional work.

The hype cycle is correcting. That's not a failure. That's maturity.

AI coding tools will get better. Claude 4.6 is substantially more capable than Claude 3.5. Context windows have expanded. Tool integration has deepened. METR's own follow-up suggests the slowdown is narrowing. The trajectory is toward genuine productivity gains.

But today, in April 2026, the honest answer is: it depends. On your experience. On your codebase. On your task. On your tools. On what you're measuring. And on whether you trust the measurement at all.

The 19% matters. Not as a final answer, but as a reminder that feeling productive and being productive are not the same thing -- and that the difference might be costing us more than we realize.


This article synthesizes findings from the METR RCT, Google's internal study, Microsoft's field experiments, GitHub's productivity research, Harvard/BCG's jagged frontier study, GitClear's code quality analysis, Faros AI's organizational study, METR's 2026 follow-up, Domenic Denicola's participant account, and Addy Osmani's practitioner review. Thirty-one sources total.