Alert Fatigue is Better Than Radio Silence (And That's a Problem)

Oct 28

Having too many alerts that drive everyone insane is still better than having no alerts at all. I've complained about alert fatigue plenty of times before, but here's the uncomfortable truth: that statement is completely backwards.

The Life Tax of Broken Alerts

When you have constant false alarms, operators stop looking at them. They just hit acknowledge and move on. They're trained to assume it's noise. But they still get woken up. They still get interrupted. They still can't make weekend plans without dragging their laptop along.

Sure, being on call means you shouldn't book a skydiving trip. But there's a difference between "be available if something breaks" and "sit in front of your computer all day because you'll get pinged every 20 minutes."

Best case? The alerts fire during working hours and destroy your focus. Research shows it takes over 23 minutes to fully regain focus after an interruption, meaning a single false alert can destroy half an hour of productive work. Worst case? They wake you up at 3 AM for the third time this week. On-call sleep disruption measurably degrades next-day performance across all sorts of professions, with engineers reporting being too tired for both work tasks and personal activities.

How Trust Breaks Down

Once engineers learn that alerts are mostly garbage, you've lost them. In a 2022 survey of over 800 IT professionals, 43% reported that more than 40% of their cloud security alerts were false positives. When nearly half your alerts are noise, teams stop trusting them. The consequences are predictable: 55% of teams in that same survey missed critical alerts due to ineffective prioritization. You build a system that cries wolf, and then you're shocked when real wolves get ignored.

It's nearly impossible to rebuild that trust. You basically have to clear out your entire alerting setup and start over. Engineers have long memories when it comes to broken tools that interrupt their sleep.

The Classic Mistake: Alerting on Debugging Metrics

Let's talk about everyone's favorite bad alert: high database CPU.

Some operator gets burned once and slaps an alert on it: "CPU over 90% for 10 minutes? Page someone!" But what is the person who got paged supposed to do with that information? Does it have customer impact? Maybe. Maybe not. It's correlated at best.

The alert probably has no description, or it says something useless like "check the database." Okay, you check the database. CPU is high (obviously, that's why the alert fired). And then what? You're just guessing at root cause without knowing if customers are even affected.

This gets things backwards. Mitigation and understanding system impact come before root cause analysis. That's basic incident response. But alerts like this force you to start debugging before you even know if there's a problem. Proper incident response training helps teams learn to respond systematically instead of reactive guessing.

How Alert Sprawl Happens

Most companies end up with alert fatigue the same way: something breaks once, you create an alert for it. Something else breaks, another alert. Repeat 10 times. Now you're drowning.

Best case? These alerts correlate with actual issues but lack useful descriptions. Worst case? The thing you're alerting on happened to be high when something bad happened, but you have no idea if it fires when things are fine because you weren't looking at it then.

Creating a good alert means looking back historically and finding all the times that metric would have fired when there wasn't an issue. Of course, if you're constantly fighting production fires, that's usually a symptom of deeper upstream problems. Before you invest heavily in alerts, consider whether the real issue is in your requirements, architecture, or development process.

Moving to a Systematic Approach

Eventually, you need to grow up and take a systematic approach to alerting.

Stop alerting on internals like database CPU. Start alerting on things customers actually feel: request latency, error rates, response times.

This isn't just my opinion. It's how Google's Site Reliability Engineering teams operate at scale. As their monitoring documentation states: "it's better to spend much more effort on catching symptoms than causes; when it comes to causes, only worry about very definite, very imminent causes." Users don't care if MySQL is down. They care if their queries are failing.

If your web application latency is high because database CPU is high, alert on the high latency. You can still have a warning (not a page-someone alert) that shows database CPU as a potential contributing factor. Don't get trapped in the build versus buy dilemma for monitoring tools; instead focus on getting the alerting philosophy right first.

This is the difference between alerting metrics and debugging metrics:

Alerting metrics: directly tied to customer impact, trigger pages
Debugging metrics: help you understand what's broken, don't trigger pages

When database CPU is high but there's no customer impact, you shouldn't get paged.

The Cleanup Work Nobody Prioritizes

You have to go through periodically and consolidate your hodgepodge of alerts into standardized ones, ideally based on metrics that are consistent across applications and understood by all engineers, not just the team that built that specific service. When you do this cleanup work, remember that the best monitoring tools are the ones your team barely has to use. Focus on what your operators actually need to do in the moment rather than building complex dashboards that nobody trusts.

What does this get you? A drastic reduction in false positive rate.

False positives happen when an alert fires but there's no customer impact. If you only alert on things directly tied to customer pain, you won't have false positives.

You might have some false negatives (things break but you don't get alerted), but long term, that's better. You'll miss fewer real alerts because your engineers aren't completely exhausted and trained to ignore everything. This exhaustion has real business consequences: constantly firefighting destroys team focus and creates the kind of reactive chaos that tanks productivity. When you finally get alert fatigue under control, your team can shift from crisis mode back to building.

The Bottom Line

Alert fatigue is better than no alerts because at least people get woken up when something breaks. But that's setting the bar so low it's underground. The real answer is fewer, better alerts that actually mean something. Stop alerting on debugging metrics, start alerting on customer impact, and clean up your alerts before they bury your on-call rotation.

Related Content

Featured

Oct 28, 2025

Alert Fatigue is Better Than Radio Silence (And That's a Problem)

Oct 28, 2025

Oct 21, 2025

Designing Monitoring Tools for the Job to Be Done

Oct 21, 2025

Successful monitoring platforms rest on a fundamental principle that many teams overlook: the format of a page should be determined by who you expect to be there and what job they need to accomplish.

This requires purpose-built interfaces, not configuration layers. Different users come to your monitoring platform with completely different needs, and your page design should reflect those differences from the ground up.

Oct 21, 2025

Oct 14, 2025

The On-Premises Revenue Trap - Why Enterprise SaaS Deployments Kill Engineering Velocity

Oct 14, 2025

Enterprise customers love asking for on-prem deployments. The contract values look irresistible: 2-5x your standard SaaS pricing, multi-year commitments, and the validation that comes with enterprise logos. But having managed hybrid and full on-prem deployments across multiple SaaS platforms, I can tell you the operational reality is a trap that strangles engineering teams.

The numbers tell a stark story: research shows that personnel costs represent 50-85% of total on-prem application ownership, with the vast majority of that time spent on monitoring, maintenance, and troubleshooting rather than innovation.

Oct 14, 2025

Oct 7, 2025

The Risk Funnel - Why Your Biggest Project Uncertainties Must Come First

Oct 7, 2025

Every engineering leader has lived this nightmare: two days from deadline, the team discovers the core architectural assumption doesn't work, the third-party API is missing critical functionality, or the algorithm can't handle production scale. A manageable project suddenly needs another week, a 100% schedule overrun.

This scenario highlights why successful engineering leadership requires systematic approaches across project organization and technical oversight, not just individual heroics.

This isn't bad luck. It's predictable project physics that most teams systematically ignore.

Oct 7, 2025

Sep 30, 2025

Why Public Communication Just Got Even More Important - The AI Amplification Effect

Sep 30, 2025

I've written before about the importance of keeping work discussions in public forums: Slack channels, JIRA tickets, shared docs, anywhere that's searchable and accessible. If it's about work, other people probably need to know about it. I've recommended that teams target 60-80% of their messages in public channels to preserve institutional knowledge and make information searchable for future team members.

With AI tools becoming ubiquitous, this practice has transformed from best practice to competitive necessity.

Sep 30, 2025

Sep 23, 2025

The Monitoring Trap - Why Build vs Buy Is the Wrong Question

Sep 23, 2025

Engineering leadership's most expensive monitoring decision isn't choosing the wrong tool. It's falling into the monitoring trap that costs organizations in wasted engineering time and preventable downtime. The classic "build vs buy" framing is fundamentally broken. It ignores how most teams end up trapped in an expensive middle ground that delivers neither cost efficiency nor operational effectiveness, creating cascading impacts on engineering velocity and business outcomes.

Sep 23, 2025

Sep 16, 2025

The Two Types of Engineers And How to Optimize for Both

Sep 16, 2025

Through managing teams across multiple clients, I've observed that engineering productivity isn't just about technical skills. It's about recognizing that different engineers thrive under different working conditions. Recent research from McKinsey's 2024 software engineering productivity study found that companies implementing tailored management approaches achieved a 20% improvement in employee experience scores, validating the importance of matching management style to individual work preferences.

Sep 16, 2025

Sep 9, 2025

Why Your Team's Productivity Drops After Every Change

Sep 9, 2025

You promote your best engineer to team lead. Three weeks later, productivity has tanked and people are frustrated. Sound familiar?

Here's what most engineering leaders don't realize: this productivity drop is completely normal and predictable. When you promote your best engineer, you're getting hit twice. You lose your best individual contributor while the team figures out how to work together. Understanding the four types of engineering leadership helps explain why this transition is so challenging.

Sep 9, 2025

Sep 2, 2025

The Upstream Root Cause Problem - Why Your Production Fires Start in Product Requirements

Sep 2, 2025

Most teams focus on faster incident response. The real solution is preventing incidents from happening in the first place.

After 10+ years of being continuously on-call across multiple SaaS platforms, I've debugged production incidents, database failures, authentication service outages, and scaling crises. Each time, the immediate focus is the same: restore service, minimize customer impact, conduct a post-mortem. Most teams follow a structured incident response process, which is absolutely necessary for operational stability.

But here's what I've learned that most incident response frameworks miss: your operational pain is usually a symptom, not the disease.

Sep 2, 2025

Aug 26, 2025

Zero Inbox for AI - Stop Hoarding Chats, Start Building Better

Aug 26, 2025

Most people treat AI tools like a digital hoarding situation: dozens of half-finished conversations cluttering their workspace, making it impossible to find anything useful. The solution isn't better chat organization—it's a fundamental shift in how you approach AI collaboration. I delete almost every AI chat I have, and it's made me dramatically more productive. The key is a simple two-category rule: either I'm asking a specific question (delete after getting the answer) or I'm building something using artifacts as staging areas for development (save the result, delete the chat). This isn't about digital minimalism—it's about transforming AI from a conversation partner into a development tool. When you stop having endless discussions and start building tangible outputs, your AI workspace becomes as clean and purposeful as a well-managed inbox, unlocking measurable performance gains in how you work.

Aug 26, 2025

alertingoperationson-call

Brian Conn https://connsulting.io

Alert Fatigue is Better Than Radio Silence (And That's a Problem)

The Life Tax of Broken Alerts

How Trust Breaks Down

The Classic Mistake: Alerting on Debugging Metrics

How Alert Sprawl Happens

Moving to a Systematic Approach

The Cleanup Work Nobody Prioritizes

The Bottom Line

Related Content

Connsulting

About

Offerings