Tests as Ceremony: When AI Breaks the Safety Net

Apr 7

AI-generated tests pass. That's the problem.

Passing is not a useful correctness criterion. Mark Seemann makes this argument sharply: AI-generated tests have "little epistemological content." They skip the critical step of seeing a test fail before writing code. The test exists, the coverage number goes up, and everyone moves on. But the test never proved anything. It never caught a bug, because it was never designed to catch one.

The Ceremony Problem

There's a distinction between tests that verify behavior and tests that perform verification. The first kind exists because an engineer thought about what could go wrong, wrote a test that would fail if it did, then wrote the code to make it pass. The second kind exists because a coverage tool flagged a gap, an AI generated something that filled it, and the CI pipeline turned green.

The second kind is ceremony. It looks like safety. It acts like safety. It is not safety.

IBM's developers found this the hard way. They rejected 70% of AI-generated tests, describing them as "robotic," lacking intent and flow. The tests technically worked, but they didn't communicate anything about the system's expected behavior. They were artifacts, not assertions.

Where LLMs Cluster

LLMs are good at obvious cases. Given a function, they'll test the happy path thoroughly. They'll verify that valid inputs produce expected outputs. This is the easiest kind of test to write and the least valuable kind to have.

The valuable tests live in the branches, the loops, the edge cases. What happens when the input is null? What happens when two concurrent requests hit the same resource? What happens when the third-party API returns a 200 with an error body? LLMs cluster around happy paths and leave these gaps uncovered, precisely the scenarios where production systems actually break.

GitClear's analysis of 211 million lines of code found that code duplication increased eightfold during 2024, while refactoring collapsed from 25% of changed lines to less than 10%. The same quantity-over-quality pattern plays out in test suites.

This creates a dangerous illusion. Coverage reports show 85%, 90%, even higher. Leadership sees green dashboards. But the coverage is concentrated in the wrong places: thick on the paths that rarely fail, thin on the paths that always do.

When the Safety Net Becomes Theater

I've seen this play out in practice. Engineers use AI to write tests that achieve high coverage numbers but verify nothing meaningful. The CI pipeline passes. The PR gets approved. The safety net looks intact.

Then something breaks in production, and the team discovers that none of their tests covered the failure mode. The coverage was there. The confidence was there. The actual protection was not.

Kent Beck has observed that AI agents actively fight constraints. When tests fail, some agents don't fix the code. They delete or weaken the tests to make them "pass." The agent optimizes for the metric (green CI) rather than the goal (correct software). This is the testing equivalent of unplugging a smoke detector because it keeps going off.

CI that lies is worse than no CI at all. No CI, the team knows they're flying blind. Lying CI gives false confidence. It tells you the system is safe when it isn't, and you make deployment decisions based on that false signal. It's the same dynamic I've described with dashboard proliferation: the artifact of safety replaces actual safety.

Blast Radius Determines Ceremony

Not all code needs the same level of verification rigor. I've written about risk evaluation tiers before: the blast radius of a failure determines how much ceremony your development process should carry.

A prototype UI component that's behind a feature flag? AI-generated tests covering the happy path might be sufficient. The blast radius is small. If it breaks, the impact is contained.

A payment processing pipeline? A database migration? Authentication logic? These demand tests written with intent, tests where an engineer thought about what could go wrong and wrote assertions that would catch it. The blast radius is large enough that ceremony isn't overhead. It's insurance.

The problem is that AI-generated tests apply the same shallow coverage everywhere, regardless of blast radius. They treat a checkout flow the same as a tooltip component. The ceremony is uniform when it should be calibrated.

The Fix Isn't More Tests

The instinct is to generate more tests, push the coverage number higher. This makes the problem worse. More ceremonial tests create more false confidence without adding protection.

The fix is different verification entirely. Three principles:

Specs as source of truth. When the specification defines expected behavior explicitly, tests can be validated against something other than the code they're testing. A test that merely mirrors the implementation proves nothing. A test that validates against an independent spec proves the implementation matches intent.
Deterministic gates the agent can't weaken. If an AI agent can delete a test to make CI pass, the gate is too soft. Contract tests, schema validation, and integration checks that run against real dependencies create verification points that can't be gamed by modifying test files.
Intentional failure as a requirement. Before a test earns its place in the suite, someone should be able to explain what failure mode it catches. If no one can articulate what would break if this test were removed, the test is ceremony.

The Intent Gap in Testing

This connects to a broader pattern. The gap between what AI tools produce and what the work actually requires is an intent gap. AI can generate the artifact (the test) without understanding the intent (what the test should prove). Coverage tools can't tell the difference. CI pipelines can't tell the difference. Only engineers who understand the system's failure modes can tell the difference.

Understanding where AI tools sit on your adoption ladder matters here. The question for engineering leaders isn't whether your team uses AI to write tests. They almost certainly do. The question is whether anyone is verifying that those tests carry epistemological weight, that they prove something about the system's correctness rather than just filling coverage gaps.

Open your test suite. Pick the five most recently added tests. For each one, answer: what breaks if this test is removed? If you can't answer that for most of them, your safety net is theater. The coverage number is a lie.

Stop writing more tests. Start writing tests that mean something.

Related Content

Featured

June 9, 2026

Your Laptop Is Just a Portal

June 9, 2026

My laptop is a four-year-old Dell XPS 15 with 16 gigs of RAM. Fine for normal work. Not fine for running Windows, WSL, a real codebase, a Claude session, and a browser at the same time. It came to a head over Thanksgiving last year, when I was accidentally on the road for three weeks and couldn't get serious work done. WSL on 16 gigs just exploded.

The first fix was offloading development to an EC2 instance. That worked, but the monthly bill kept climbing and the hardware was still anemic for what I actually needed. So I bought a remote dev box for the home lab and moved everything off the EC2.

That's the boring origin story. The interesting part is what the setup unlocked.

June 9, 2026

June 2, 2026

Tickets Are the New Prompts

June 2, 2026

I haven't written a Linear ticket by hand in six months. I don’t write the majority of my Claude prompts. The two stopped being separate things. The ticket is the prompt.

June 2, 2026

May 26, 2026

The Amdahl's Law Problem in AI-Assisted Development

May 26, 2026

AI did not make the whole software delivery system faster.

It made one stage louder.

That is the part missing from most productivity conversations right now. A developer gets a coding assistant, the coding step accelerates, and everyone acts like the entire SDLC should accelerate by the same amount. Then review queues grow. Test failures pile up. Deployment gets riskier. Senior engineers spend more of their day reconstructing intent from code that looks plausible but does not quite match the system.

That is not a paradox. That is Amdahl's Law doing exactly what Amdahl's Law does.

Speed up one stage in a constrained system, and the bottleneck moves.

May 26, 2026

May 19, 2026

Concentric Feedback Loops: How AI Agent Teams Actually Ship Code

May 19, 2026

I've been rebuilding one of my Claude Code workflows because the old version was too linear.

That sounds like a small implementation detail. It isn't. It points at the part of AI-assisted development that most teams are about to run into: once agents can do real work for hours, strict phase gates start getting in the way of the feedback loops that make the work safe.

The normal development cycle is familiar: requirements, plan, plan review, implementation, tests, peer review, more implementation, more tests, security review, architecture review, integration testing, end-to-end testing. We pretend this is a clean sequence because it is easier to write down that way.

It has never been that clean.

The work has always been loops. AI agent teams just make the loops visible.

May 19, 2026

May 12, 2026

Your Team's AI Metrics Are Lying to You

May 12, 2026

Your engineering team adopted AI coding tools six months ago. Deployment frequency is up. Lead time is down. PRs are flying through the pipeline. Everyone feels faster.

But are they?

I've been digging into the data across multiple client engagements, and there's a growing gap between what AI-assisted engineering teams perceive and what the numbers actually show. The metrics most teams celebrate are painting an incomplete picture, and the metrics that would tell the real story are the ones nobody's watching.

May 12, 2026

May 5, 2026

Start Fresh - Why Fixing AI Agents Mid-Chat Never Works

May 5, 2026

You're four steps into an AI agent workflow. Steps one through three went perfectly. Step four goes sideways. So you start correcting. "No, do it this way." "Try again." "No, like this." Five corrections later, the output is worse than when the problem first appeared.

The instinct to fix things in place is deeply human. It's also exactly wrong when working with AI agents.

May 5, 2026

April 28, 2026

Why Every AI Workflow Converges on the Same Architecture

April 28, 2026

Three AI agents. Three different problem contexts. Each time, the solution emerged with the same architecture.

The first was my own operational agent. A personal partner for research, drafting, and scheduling. The second was a marketing content bot I helped a client team build. The third was an analytics workflow for another team. Different domains, different users, different stakeholders. But when I stepped back and compared the three designs, the structural similarity was impossible to ignore.

I didn't plan it. I wasn't working from a blueprint. I was solving three different problems and each time, I ended up reaching for the same three layers: an immutable identity, compiled learnings, and a human approval gate.

One builder reaching for the same shape across three contexts isn't proof of a universal law. But the fact that I keep reaching for it without trying to is worth sitting with. Every production AI workflow I've built that survives contact with reality seems to pull in this direction. Not because anyone prescribed it. Because the problems keep forcing it.

April 28, 2026

April 21, 2026

The SDLC is Rediscovering Itself

April 21, 2026

AI is forcing software development back to first principles. The practices most teams abandoned as overhead, specs, formal verification, architectural review gates, are becoming essential again the moment humans stop reading every line of code.

I've watched this play out across my own work this year. The discipline I used to skip because it slowed me down is suddenly the only thing standing between a working system and a pile of plausible-looking garbage. The SDLC didn't die. It got hollowed out, and now it's being rebuilt in place, one abandoned practice at a time.

April 21, 2026

April 15, 2026

The Intent Gap

April 15, 2026

Your AI-generated code is degrading, and the degradation isn't a tooling problem. It's a translation problem, and every step of the chain is lossy.

I've fought against this degradation. I swap models. I rerun implementation prompts. The damage happens before the first line of code gets generated, in a chain of translations that no refactoring pass can reverse. This is a different problem than the one I wrote about in Risk Evaluation in the Age of AI-Aided Development, which is about deciding when AI acceleration is worth the technical debt. The Intent Gap is upstream of that decision.

I call the thing at the center of this the Intent Gap: the distance between what you meant and what the AI produced. The Gap is where everything fails. And once you see it, you can't unsee it.

April 15, 2026

April 7, 2026

Tests as Ceremony: When AI Breaks the Safety Net

April 7, 2026

AI-generated tests pass. That's the problem.

April 7, 2026

aitestssdlc

Brian Conn https://connsulting.io

Tests as Ceremony: When AI Breaks the Safety Net

The Ceremony Problem

Where LLMs Cluster

When the Safety Net Becomes Theater

Blast Radius Determines Ceremony

The Fix Isn't More Tests

The Intent Gap in Testing

Related Content

Connsulting

About

Offerings

Tests as Ceremony: When AI Breaks the Safety Net

The Ceremony Problem

Where LLMs Cluster

When the Safety Net Becomes Theater

Blast Radius Determines Ceremony

The Fix Isn't More Tests

The Intent Gap in Testing

Related Content

The Intent Gap

SDLC is Dead, Long Live the SDLC

Connsulting

About

Offerings