Your Team's AI Metrics Are Lying to You
Your engineering team adopted AI coding tools six months ago. Deployment frequency is up. Lead time is down. PRs are flying through the pipeline. Everyone feels faster.
But are they?
I've been digging into the data across multiple client engagements, and there's a growing gap between what AI-assisted engineering teams perceive and what the numbers actually show. The metrics most teams celebrate are painting an incomplete picture, and the metrics that would tell the real story are the ones nobody's watching.
The Perception Problem
METR's randomized controlled trial found that developers using AI tools felt 20% faster. When actually measured, they were 19% slower.
That's not a small miss. That's a perception gap of nearly 40 percentage points. Developers genuinely believed they were getting more done, and the opposite was true.
This isn't an isolated finding. It's a pattern I'm seeing across organizations that have rolled out AI coding assistants without rethinking how they measure success.
The Metrics That Lie
DORA metrics have become the default scorecard for engineering teams. Deployment frequency, lead time for changes, change failure rate, mean time to recovery. On the surface, AI tools make the first two look great.
More code gets written. More PRs get opened. More merges happen. Deployment frequency goes up. Lead time drops because individual changes move faster through the pipeline.
But Faros AI analyzed data from over 10,000 developers and found something troubling: 98% more PRs were being merged, while review time increased by 91%. The net result? Organizational delivery actually declined by 1.5%. Developers using AI also interacted with 47% more pull requests daily, compounding the cognitive load on reviewers.
More PRs. More review burden. Less actual delivery. The volume metrics went up while the outcome metrics went sideways or down.
The Metrics That Matter
Code churn is climbing. GitClear's analysis of 211 million lines of code found that duplicated code blocks rose 8x after AI tool adoption, while refactored lines dropped from 24% to 9.5%. Code is being generated faster, but it's lower quality code that needs more rework. At the client engagements I've analyzed, code churn has risen from 5.5% to 7.9%, a trajectory that matches these broader findings.
AI-generated code creates more issues. CodeRabbit's analysis of 470 repositories showed that AI-generated code creates 1.7x more issues than human-written code, with performance inefficiencies appearing nearly 8x more often. More code, more bugs, more time spent fixing what shouldn't have broken in the first place.
Change failure rate is rising. This is the DORA metric that tells you whether your changes actually work. When AI generates more code faster, but that code is buggier and more duplicative, change failure rate climbs. Teams ship more, but they also break more.
Rework rate is increasing. When developers accept AI suggestions without deep review, code that looked right at generation time comes back as a bug, a performance issue, or a maintenance headache. The time "saved" in writing gets spent twice over in fixing.
Why This Happens
The core issue is that AI tools optimize for the wrong thing. They optimize for code generation speed. But software development has never been bottlenecked by typing speed. The hard parts are understanding the problem, designing the right solution, and validating that the solution actually works.
When you make the easy part faster without improving the hard parts, you get exactly what the data shows: more output, lower quality, more rework. This is the quality in, quality out problem applied to the entire development pipeline.
There's also a review problem. Stack Overflow's 2025 Developer Survey found that 46% of developers distrust AI-generated output, up from 31% the prior year. The top frustration? 66% of developers cited dealing with AI solutions that are "almost right, but not quite." Trust is eroding even among the people using these tools daily. But the volume of AI-generated code still gets merged because teams haven't adjusted their review processes to match the new reality.
The Intent Gap at the Metrics Level
This is what I call the Intent Gap operating at the organizational level. The intent is to ship faster and build more. The reality is that you're generating more code while delivering less value per line.
The gap between what you intend to measure (engineering productivity) and what you actually measure (code volume) creates a blind spot. Teams celebrate the metrics going up without noticing that the metrics going down are the ones that actually matter.
It's the same pattern I see in every domain where measurement goes wrong: you optimize for what's easy to count and miss what's hard to measure but actually important. I've written about this dynamic with dashboards that become artifacts of safety rather than actual safety, and the same logic applies to DORA metrics in an AI-assisted world.
What to Watch Instead
If your team has adopted AI coding tools, here's where to focus your measurement:
Track change failure rate alongside deployment frequency. If deployments are up but failures are too, you're not moving faster. You're moving louder.
Monitor code churn and rework rates. Measure the percentage of code that gets modified within 14 days of being written. If this number is climbing post-AI adoption, your "faster" code is creating downstream drag.
Measure review quality, not just review speed. How many comments per review? How many review cycles before merge? If reviews are getting rubber-stamped because the volume is overwhelming, you have a quality gate that's no longer gating.
Watch net delivery, not gross output. Features shipped and validated by users, not PRs merged. Outcomes, not activity.
The teams that will get real value from AI coding tools are the ones that rethink their measurement to match the new reality. Understanding where your team sits on the AI adoption ladder is the first step. The second is making sure your metrics actually measure what you think they do.
Stop celebrating volume. Start measuring what matters.

