Alert Fatigue is Better Than Radio Silence (And That's a Problem)
Having too many alerts that drive everyone insane is still better than having no alerts at all. I've complained about alert fatigue plenty of times before, but here's the uncomfortable truth: that statement is completely backwards.
The Life Tax of Broken Alerts
When you have constant false alarms, operators stop looking at them. They just hit acknowledge and move on. They're trained to assume it's noise. But they still get woken up. They still get interrupted. They still can't make weekend plans without dragging their laptop along.
Sure, being on call means you shouldn't book a skydiving trip. But there's a difference between "be available if something breaks" and "sit in front of your computer all day because you'll get pinged every 20 minutes."
Best case? The alerts fire during working hours and destroy your focus. Research shows it takes over 23 minutes to fully regain focus after an interruption, meaning a single false alert can destroy half an hour of productive work. Worst case? They wake you up at 3 AM for the third time this week. On-call sleep disruption measurably degrades next-day performance across all sorts of professions, with engineers reporting being too tired for both work tasks and personal activities.
How Trust Breaks Down
Once engineers learn that alerts are mostly garbage, you've lost them. In a 2022 survey of over 800 IT professionals, 43% reported that more than 40% of their cloud security alerts were false positives. When nearly half your alerts are noise, teams stop trusting them. The consequences are predictable: 55% of teams in that same survey missed critical alerts due to ineffective prioritization. You build a system that cries wolf, and then you're shocked when real wolves get ignored.
It's nearly impossible to rebuild that trust. You basically have to clear out your entire alerting setup and start over. Engineers have long memories when it comes to broken tools that interrupt their sleep.
The Classic Mistake: Alerting on Debugging Metrics
Let's talk about everyone's favorite bad alert: high database CPU.
Some operator gets burned once and slaps an alert on it: "CPU over 90% for 10 minutes? Page someone!" But what is the person who got paged supposed to do with that information? Does it have customer impact? Maybe. Maybe not. It's correlated at best.
The alert probably has no description, or it says something useless like "check the database." Okay, you check the database. CPU is high (obviously, that's why the alert fired). And then what? You're just guessing at root cause without knowing if customers are even affected.
This gets things backwards. Mitigation and understanding system impact come before root cause analysis. That's basic incident response. But alerts like this force you to start debugging before you even know if there's a problem. Proper incident response training helps teams learn to respond systematically instead of reactive guessing.
How Alert Sprawl Happens
Most companies end up with alert fatigue the same way: something breaks once, you create an alert for it. Something else breaks, another alert. Repeat 10 times. Now you're drowning.
Best case? These alerts correlate with actual issues but lack useful descriptions. Worst case? The thing you're alerting on happened to be high when something bad happened, but you have no idea if it fires when things are fine because you weren't looking at it then.
Creating a good alert means looking back historically and finding all the times that metric would have fired when there wasn't an issue. Of course, if you're constantly fighting production fires, that's usually a symptom of deeper upstream problems. Before you invest heavily in alerts, consider whether the real issue is in your requirements, architecture, or development process.
Moving to a Systematic Approach
Eventually, you need to grow up and take a systematic approach to alerting.
Stop alerting on internals like database CPU. Start alerting on things customers actually feel: request latency, error rates, response times.
This isn't just my opinion. It's how Google's Site Reliability Engineering teams operate at scale. As their monitoring documentation states: "it's better to spend much more effort on catching symptoms than causes; when it comes to causes, only worry about very definite, very imminent causes." Users don't care if MySQL is down. They care if their queries are failing.
If your web application latency is high because database CPU is high, alert on the high latency. You can still have a warning (not a page-someone alert) that shows database CPU as a potential contributing factor. Don't get trapped in the build versus buy dilemma for monitoring tools; instead focus on getting the alerting philosophy right first.
This is the difference between alerting metrics and debugging metrics:
- Alerting metrics: directly tied to customer impact, trigger pages
- Debugging metrics: help you understand what's broken, don't trigger pages
When database CPU is high but there's no customer impact, you shouldn't get paged.
The Cleanup Work Nobody Prioritizes
You have to go through periodically and consolidate your hodgepodge of alerts into standardized ones, ideally based on metrics that are consistent across applications and understood by all engineers, not just the team that built that specific service. When you do this cleanup work, remember that the best monitoring tools are the ones your team barely has to use. Focus on what your operators actually need to do in the moment rather than building complex dashboards that nobody trusts.
What does this get you? A drastic reduction in false positive rate.
False positives happen when an alert fires but there's no customer impact. If you only alert on things directly tied to customer pain, you won't have false positives.
You might have some false negatives (things break but you don't get alerted), but long term, that's better. You'll miss fewer real alerts because your engineers aren't completely exhausted and trained to ignore everything. This exhaustion has real business consequences: constantly firefighting destroys team focus and creates the kind of reactive chaos that tanks productivity. When you finally get alert fatigue under control, your team can shift from crisis mode back to building.
The Bottom Line
Alert fatigue is better than no alerts because at least people get woken up when something breaks. But that's setting the bar so low it's underground. The real answer is fewer, better alerts that actually mean something. Stop alerting on debugging metrics, start alerting on customer impact, and clean up your alerts before they bury your on-call rotation.

