Your Dashboards Are a Code Smell (And How to Fix It)
This is one of my saltier posts. You've been warned.
I've been on call for over a decade across production SaaS platforms. I've debugged cascading failures at 3 AM, managed 99.99%+ uptime commitments, and transformed reactive teams into proactive operational excellence cultures. Through all of that, I've learned one uncomfortable truth: if your team relies on dashboards for incident response, you have an observability problem.
Dashboards are the lowest common denominator for monitoring. Over-reliance on them (or truly any reliance on them for production incident response) is a code smell for your observability strategy.
The One Dashboard Worth Having
Before you accuse me of dashboard nihilism, let me be clear: some dashboards can be useful. But "useful" and "necessary" are different things.
The one dashboard I've genuinely appreciated was a top-level system throughput visualization. Simple concept: stuff coming in on the left should roughly equal stuff being processed and going out on the right. Input volume should match output volume. When they diverge, you have a problem.
This dashboard was valuable for a specific reason: it provided a quick visual indicator during triage. When Kafka lag increased or message queue processing times spiked, a glance at this dashboard immediately showed whether more data was flowing in than could be processed out. That visual confirmation was useful.
But here's the critical distinction: the dashboard didn't alert us to the problem. Good alerting metrics did. The dashboard was a debugging tool, not a detection tool. We looked at it after we already knew something was wrong, not to discover that something was wrong in the first place.
Alerting Metrics vs. Debugging Metrics
This distinction matters more than most teams realize. Good alerting metrics tell you that there is a problem. Debugging metrics help you understand what the problem is. These are fundamentally different purposes, and trying to build metrics that do both effectively usually results in metric explosion.
When Kafka lag increased, our alerts told us: "Processing is falling behind input, customers do not have up to date alerts anymore." That's an alerting metric. It detects the problem and signals business impact.
The throughput dashboard showing input/output divergence is a debugging metric. It helps explain why processing fell behind. Was input volume spiking? Was processing throughput dropping? Both? The visual made this clear.
Most teams conflate these purposes. They build dashboards hoping they'll both detect problems and explain them. This rarely works. Detection requires focused, high-signal metrics with clear thresholds. Debugging requires context-rich, multi-dimensional data that changes based on the specific issue. This is why designing monitoring tools for the job to be done matters: different users need different interfaces optimized for their specific goals.
Choose one purpose per metric. Build alerts for detection. Use debugging metrics for investigation. Stop expecting dashboards to do both.
The Pet Dashboard Problem
Here's what actually happens in most engineering organizations:
An incident occurs. An engineer finds signal in some obscure metric while debugging and creates a dashboard. Now I have my dashboard. You have your dashboard. They have their dashboard. Everyone has their pet dashboards showing the metrics they personally find meaningful.
And lo and behold, dashboards are rarely curated. They just pile up like technical debt you pretend doesn't exist.
This proliferation creates three problems:
First, dashboards become institutional knowledge traps. The metrics I gravitate toward reflect my understanding of the system. New team members lack that context, see unclear dashboards, and build their own. The cycle continues.
Second, pet dashboards rarely get maintained. The engineer who created the dashboard moves to a different team. The service architecture evolves. The metrics that were relevant six months ago may not be relevant today. But the dashboard persists, cluttering your observability tools with stale, misleading data.
Third, dashboard proliferation trains teams to hunt and peck for signal instead of having signal come to them. When your primary debugging strategy is "check these 47 dashboards someone created at some point," you've built an observability system that scales poorly and burns out new engineers who can't decode the institutional knowledge embedded in dashboard choices. This problem compounds quickly: 85% of DevOps teams rely on multiple tools for observability, creating complexity that makes it harder to distinguish signal from noise.
The Reactive Dashboard Trap
Creating dashboards after incidents is inherently reactive. It assumes you'll debug the same issue again.
Here's the thing: if you're debugging the same issue twice, you failed to properly resolve it the first time. And while you're creating those dashboards, you're not addressing the real problem. According to Google's DORA research, elite performing teams recover from failures in less than one hour. They're not achieving that by maintaining extensive dashboard libraries.
Postmortems should identify not just the immediate fix but the systemic gaps that allowed the issue to occur. If a dashboard would have helped, the real question is: why didn't your alerts catch this? Why didn't your existing debugging metrics surface this? What gaps in your observability strategy created this blind spot? As I've discussed in The Upstream Root Cause Problem, your operational pain is usually a symptom, not the disease.
Creating a dashboard after an incident is saying, "Next time this exact thing happens, we'll be slightly better prepared." That's not operational excellence. That's building a museum of past failures.
The better question: what alerting metric would have detected this class of problems before customer impact? What standardized debugging metrics would have made the investigation faster regardless of the specific failure mode?
The Evergreen Exception
There is one scenario where dashboards remain useful: standardized metrics across your entire platform.
If you have consistent alerting metrics for every service (HTTP response latency percentiles, error rates, throughput, async processing times) and dashboards using those standardized metrics to show all services in one view, that can be valuable.
The key is "evergreen." These dashboards stay relevant because they're built on standardized metrics that don't change as individual services evolve. New services automatically appear in these dashboards because they implement the same metric standards. Engineers understand these dashboards because the metrics are consistent across the platform.
But notice: these dashboards only work because of the underlying investment in metric standardization. The dashboard is not the solution. Consistent, well-designed metrics are the solution. The dashboard is just one possible interface to those metrics.
What To Do Instead
If dashboards aren't the answer, what is?
If you're in a truly bad place right now where you have terrible alerts and no consistent metrics, you know what? Poking through random dashboards might actually be your best approach at that point. That's a rough place to be. (And if you find yourself choosing between building your own monitoring infrastructure or using managed solutions, read The Monitoring Trap to understand the hidden costs of that middle ground.)
But there are many ways to avoid that situation in the first place:
Invest in consistent alerting metrics across all services. Every service should expose the same core metrics: request rates, error rates, latency percentiles, resource utilization. When these are standardized, any engineer can investigate any service using the same mental model. This foundation also makes incident response training more effective (see Two-Phase War Games for how to scale this across teams).
Build alerts that tell you where to look and what to check. A good alert doesn't just say "something is wrong." It says: "API latency P95 exceeds 2 seconds, likely database query slowdown, check query performance dashboard or database slow query log."
Provide debugging steps in the alert itself. Link directly to relevant debugging metrics, not to a wall of dashboards. Point to the specific query tool, log search, or metric most likely to explain this class of failure.
Curate, don't proliferate. If you must have dashboards, maintain them actively. Archive dashboards that haven't been viewed in 90 days. Assign ownership. Review quarterly. Treat dashboards as technical debt that requires active management.
Measure dashboard utility honestly. Track which dashboards get used during incidents. If a dashboard hasn't been accessed during the last 10 incidents, it's probably not useful. Delete it.
The Uncomfortable Truth
Dashboards feel productive. Building them feels like improving observability. Looking at them during incidents feels like doing something.
But hunting through dashboards during an incident means your alerts didn't tell you what you needed to know. It means your debugging metrics aren't organized around actual investigation workflows. It means you're relying on institutional knowledge embedded in ad-hoc visualizations instead of systematic observability design.
If your incident response starts with "check the dashboards," you have an alerting problem, not a dashboard shortage.
Stop building more dashboards. Fix your alerts. Standardize your metrics. Provide debugging paths directly from alerts. Then, if you still need dashboards, build the minimum set that adds value.
Your 3 AM self will thank you for clear alerts and standardized metrics, not for access to 47 dashboards that may or may not be relevant to the current incident.
Takeaway
Dashboards are not inherently bad. Over-reliance on them for incident response is a symptom of deeper observability gaps. Before you build another dashboard, ask:
- Why don't our alerts detect this?
- Why aren't our standardized debugging metrics sufficient for this investigation?
- Are we building a reusable observability pattern or a one-time workaround?
Answer those questions first. Then decide if you actually need that dashboard.

