Your Team's AI Metrics Are Lying to You

May 12

Your engineering team adopted AI coding tools six months ago. Deployment frequency is up. Lead time is down. PRs are flying through the pipeline. Everyone feels faster.

But are they?

I've been digging into the data across multiple client engagements, and there's a growing gap between what AI-assisted engineering teams perceive and what the numbers actually show. The metrics most teams celebrate are painting an incomplete picture, and the metrics that would tell the real story are the ones nobody's watching.

The Perception Problem

METR's randomized controlled trial found that developers using AI tools felt 20% faster. When actually measured, they were 19% slower.

That's not a small miss. That's a perception gap of nearly 40 percentage points. Developers genuinely believed they were getting more done, and the opposite was true.

This isn't an isolated finding. It's a pattern I'm seeing across organizations that have rolled out AI coding assistants without rethinking how they measure success.

The Metrics That Lie

DORA metrics have become the default scorecard for engineering teams. Deployment frequency, lead time for changes, change failure rate, mean time to recovery. On the surface, AI tools make the first two look great.

More code gets written. More PRs get opened. More merges happen. Deployment frequency goes up. Lead time drops because individual changes move faster through the pipeline.

But Faros AI analyzed data from over 10,000 developers and found something troubling: 98% more PRs were being merged, while review time increased by 91%. The net result? Organizational delivery actually declined by 1.5%. Developers using AI also interacted with 47% more pull requests daily, compounding the cognitive load on reviewers.

More PRs. More review burden. Less actual delivery. The volume metrics went up while the outcome metrics went sideways or down.

The Metrics That Matter

Code churn is climbing. GitClear's analysis of 211 million lines of code found that duplicated code blocks rose 8x after AI tool adoption, while refactored lines dropped from 24% to 9.5%. Code is being generated faster, but it's lower quality code that needs more rework. At the client engagements I've analyzed, code churn has risen from 5.5% to 7.9%, a trajectory that matches these broader findings.

AI-generated code creates more issues. CodeRabbit's analysis of 470 repositories showed that AI-generated code creates 1.7x more issues than human-written code, with performance inefficiencies appearing nearly 8x more often. More code, more bugs, more time spent fixing what shouldn't have broken in the first place.

Change failure rate is rising. This is the DORA metric that tells you whether your changes actually work. When AI generates more code faster, but that code is buggier and more duplicative, change failure rate climbs. Teams ship more, but they also break more.

Rework rate is increasing. When developers accept AI suggestions without deep review, code that looked right at generation time comes back as a bug, a performance issue, or a maintenance headache. The time "saved" in writing gets spent twice over in fixing.

Why This Happens

The core issue is that AI tools optimize for the wrong thing. They optimize for code generation speed. But software development has never been bottlenecked by typing speed. The hard parts are understanding the problem, designing the right solution, and validating that the solution actually works.

When you make the easy part faster without improving the hard parts, you get exactly what the data shows: more output, lower quality, more rework. This is the quality in, quality out problem applied to the entire development pipeline.

There's also a review problem. Stack Overflow's 2025 Developer Survey found that 46% of developers distrust AI-generated output, up from 31% the prior year. The top frustration? 66% of developers cited dealing with AI solutions that are "almost right, but not quite." Trust is eroding even among the people using these tools daily. But the volume of AI-generated code still gets merged because teams haven't adjusted their review processes to match the new reality.

The Intent Gap at the Metrics Level

This is what I call the Intent Gap operating at the organizational level. The intent is to ship faster and build more. The reality is that you're generating more code while delivering less value per line.

The gap between what you intend to measure (engineering productivity) and what you actually measure (code volume) creates a blind spot. Teams celebrate the metrics going up without noticing that the metrics going down are the ones that actually matter.

It's the same pattern I see in every domain where measurement goes wrong: you optimize for what's easy to count and miss what's hard to measure but actually important. I've written about this dynamic with dashboards that become artifacts of safety rather than actual safety, and the same logic applies to DORA metrics in an AI-assisted world.

What to Watch Instead

If your team has adopted AI coding tools, here's where to focus your measurement:

Track change failure rate alongside deployment frequency. If deployments are up but failures are too, you're not moving faster. You're moving louder.

Monitor code churn and rework rates. Measure the percentage of code that gets modified within 14 days of being written. If this number is climbing post-AI adoption, your "faster" code is creating downstream drag.

Measure review quality, not just review speed. How many comments per review? How many review cycles before merge? If reviews are getting rubber-stamped because the volume is overwhelming, you have a quality gate that's no longer gating.

Watch net delivery, not gross output. Features shipped and validated by users, not PRs merged. Outcomes, not activity.

The teams that will get real value from AI coding tools are the ones that rethink their measurement to match the new reality. Understanding where your team sits on the AI adoption ladder is the first step. The second is making sure your metrics actually measure what you think they do.

Stop celebrating volume. Start measuring what matters.

Related Content

Featured

May 12, 2026

Your Team's AI Metrics Are Lying to You

May 12, 2026

Your engineering team adopted AI coding tools six months ago. Deployment frequency is up. Lead time is down. PRs are flying through the pipeline. Everyone feels faster.

But are they?

May 12, 2026

May 5, 2026

Start Fresh - Why Fixing AI Agents Mid-Chat Never Works

May 5, 2026

You're four steps into an AI agent workflow. Steps one through three went perfectly. Step four goes sideways. So you start correcting. "No, do it this way." "Try again." "No, like this." Five corrections later, the output is worse than when the problem first appeared.

The instinct to fix things in place is deeply human. It's also exactly wrong when working with AI agents.

May 5, 2026

Apr 28, 2026

Why Every AI Workflow Converges on the Same Architecture

Apr 28, 2026

Three AI agents. Three different problem contexts. Each time, the solution emerged with the same architecture.

The first was my own operational agent. A personal partner for research, drafting, and scheduling. The second was a marketing content bot I helped a client team build. The third was an analytics workflow for another team. Different domains, different users, different stakeholders. But when I stepped back and compared the three designs, the structural similarity was impossible to ignore.

I didn't plan it. I wasn't working from a blueprint. I was solving three different problems and each time, I ended up reaching for the same three layers: an immutable identity, compiled learnings, and a human approval gate.

One builder reaching for the same shape across three contexts isn't proof of a universal law. But the fact that I keep reaching for it without trying to is worth sitting with. Every production AI workflow I've built that survives contact with reality seems to pull in this direction. Not because anyone prescribed it. Because the problems keep forcing it.

Apr 28, 2026

Apr 21, 2026

The SDLC is Rediscovering Itself

Apr 21, 2026

AI is forcing software development back to first principles. The practices most teams abandoned as overhead, specs, formal verification, architectural review gates, are becoming essential again the moment humans stop reading every line of code.

I've watched this play out across my own work this year. The discipline I used to skip because it slowed me down is suddenly the only thing standing between a working system and a pile of plausible-looking garbage. The SDLC didn't die. It got hollowed out, and now it's being rebuilt in place, one abandoned practice at a time.

Apr 21, 2026

Apr 15, 2026

The Intent Gap

Apr 15, 2026

Your AI-generated code is degrading, and the degradation isn't a tooling problem. It's a translation problem, and every step of the chain is lossy.

I've fought against this degradation. I swap models. I rerun implementation prompts. The damage happens before the first line of code gets generated, in a chain of translations that no refactoring pass can reverse. This is a different problem than the one I wrote about in Risk Evaluation in the Age of AI-Aided Development, which is about deciding when AI acceleration is worth the technical debt. The Intent Gap is upstream of that decision.

I call the thing at the center of this the Intent Gap: the distance between what you meant and what the AI produced. The Gap is where everything fails. And once you see it, you can't unsee it.

Apr 15, 2026

Apr 7, 2026

Tests as Ceremony: When AI Breaks the Safety Net

Apr 7, 2026

AI-generated tests pass. That's the problem.

Passing is not a useful correctness criterion. Mark Seemann makes this argument sharply: AI-generated tests have "little epistemological content." They skip the critical step of seeing a test fail before writing code. The test exists, the coverage number goes up, and everyone moves on. But the test never proved anything. It never caught a bug, because it was never designed to catch one.

Apr 7, 2026

Apr 2, 2026

SDLC is Dead, Long Live the SDLC

Apr 2, 2026

The software development lifecycle isn't dead. It just lost its center of gravity.

For decades, the bottleneck in software development was writing code. Requirements flowed downhill through design, architecture, and planning, all funneling toward the expensive part: turning ideas into working software. The entire SDLC was organized around this constraint. We optimized hiring, tooling, and process around the assumption that code production was the hard part.

AI changed that equation. Code writing is now commoditized. AI can produce syntactically correct, functionally reasonable code at a pace no human team can match. The bottleneck didn't disappear. It moved.

Apr 2, 2026

Mar 24, 2026

The Three Questions That Tell You What to Automate

Mar 24, 2026

Not every repetitive task is worth automating. Some tasks feel tedious but resist automation because they require subtle judgment at every step. Others feel complex but are actually just long sequences of mechanical steps. Knowing the difference saves you from building automation that never delivers, or manually grinding through work a script could handle. I've found that three questions reliably separate the automatable from the not-yet-automatable. They work whether you're evaluating a candidate for AI assistance, a custom script, or a full workflow tool. This framework also applies to choosing what to build versus what to shed, as I explored in Tactical Work Shedding.

Mar 24, 2026

Mar 17, 2026

From Two Minutes to Ten Seconds - The ROI of Personalized Software

Mar 17, 2026

I've written before about personalized software as the hidden iceberg of the AI era. Software that was never economical to build, but that people genuinely want. I keep coming back to this idea because I keep building more of it. My latest example is so small it barely qualifies as a project, and that's exactly what makes it worth talking about.

Mar 17, 2026

Mar 11, 2026

Claude Code as an Operational Partner for DevOps

Mar 11, 2026

People are building incredible things with AI coding tools. But there's a quieter, equally powerful use case: using Claude Code as an operational partner. DevOps work is half investigation, and AI coding tools are remarkably effective at analysis, script generation, and iterative diagnostics alongside a human who handles execution and judgment.

Mar 11, 2026

aileadershipengineering

Brian Conn https://connsulting.io

Your Team's AI Metrics Are Lying to You

The Perception Problem

The Metrics That Lie

The Metrics That Matter

Why This Happens

The Intent Gap at the Metrics Level

What to Watch Instead

Related Content

Connsulting

About

Offerings