Your Dashboards Are a Code Smell (And How to Fix It)

Nov 4

This is one of my saltier posts. You've been warned.

I've been on call for over a decade across production SaaS platforms. I've debugged cascading failures at 3 AM, managed 99.99%+ uptime commitments, and transformed reactive teams into proactive operational excellence cultures. Through all of that, I've learned one uncomfortable truth: if your team relies on dashboards for incident response, you have an observability problem.

Dashboards are the lowest common denominator for monitoring. Over-reliance on them (or truly any reliance on them for production incident response) is a code smell for your observability strategy.

The One Dashboard Worth Having

Before you accuse me of dashboard nihilism, let me be clear: some dashboards can be useful. But "useful" and "necessary" are different things.

The one dashboard I've genuinely appreciated was a top-level system throughput visualization. Simple concept: stuff coming in on the left should roughly equal stuff being processed and going out on the right. Input volume should match output volume. When they diverge, you have a problem.

This dashboard was valuable for a specific reason: it provided a quick visual indicator during triage. When Kafka lag increased or message queue processing times spiked, a glance at this dashboard immediately showed whether more data was flowing in than could be processed out. That visual confirmation was useful.

But here's the critical distinction: the dashboard didn't alert us to the problem. Good alerting metrics did. The dashboard was a debugging tool, not a detection tool. We looked at it after we already knew something was wrong, not to discover that something was wrong in the first place.

Alerting Metrics vs. Debugging Metrics

This distinction matters more than most teams realize. Good alerting metrics tell you that there is a problem. Debugging metrics help you understand what the problem is. These are fundamentally different purposes, and trying to build metrics that do both effectively usually results in metric explosion.

When Kafka lag increased, our alerts told us: "Processing is falling behind input, customers do not have up to date alerts anymore." That's an alerting metric. It detects the problem and signals business impact.

The throughput dashboard showing input/output divergence is a debugging metric. It helps explain why processing fell behind. Was input volume spiking? Was processing throughput dropping? Both? The visual made this clear.

Most teams conflate these purposes. They build dashboards hoping they'll both detect problems and explain them. This rarely works. Detection requires focused, high-signal metrics with clear thresholds. Debugging requires context-rich, multi-dimensional data that changes based on the specific issue. This is why designing monitoring tools for the job to be done matters: different users need different interfaces optimized for their specific goals.

Choose one purpose per metric. Build alerts for detection. Use debugging metrics for investigation. Stop expecting dashboards to do both.

The Pet Dashboard Problem

Here's what actually happens in most engineering organizations:

An incident occurs. An engineer finds signal in some obscure metric while debugging and creates a dashboard. Now I have my dashboard. You have your dashboard. They have their dashboard. Everyone has their pet dashboards showing the metrics they personally find meaningful.

And lo and behold, dashboards are rarely curated. They just pile up like technical debt you pretend doesn't exist.

This proliferation creates three problems:

First, dashboards become institutional knowledge traps. The metrics I gravitate toward reflect my understanding of the system. New team members lack that context, see unclear dashboards, and build their own. The cycle continues.

Second, pet dashboards rarely get maintained. The engineer who created the dashboard moves to a different team. The service architecture evolves. The metrics that were relevant six months ago may not be relevant today. But the dashboard persists, cluttering your observability tools with stale, misleading data.

Third, dashboard proliferation trains teams to hunt and peck for signal instead of having signal come to them. When your primary debugging strategy is "check these 47 dashboards someone created at some point," you've built an observability system that scales poorly and burns out new engineers who can't decode the institutional knowledge embedded in dashboard choices. This problem compounds quickly: 85% of DevOps teams rely on multiple tools for observability, creating complexity that makes it harder to distinguish signal from noise.

The Reactive Dashboard Trap

Creating dashboards after incidents is inherently reactive. It assumes you'll debug the same issue again.

Here's the thing: if you're debugging the same issue twice, you failed to properly resolve it the first time. And while you're creating those dashboards, you're not addressing the real problem. According to Google's DORA research, elite performing teams recover from failures in less than one hour. They're not achieving that by maintaining extensive dashboard libraries.

Postmortems should identify not just the immediate fix but the systemic gaps that allowed the issue to occur. If a dashboard would have helped, the real question is: why didn't your alerts catch this? Why didn't your existing debugging metrics surface this? What gaps in your observability strategy created this blind spot? As I've discussed in The Upstream Root Cause Problem, your operational pain is usually a symptom, not the disease.

Creating a dashboard after an incident is saying, "Next time this exact thing happens, we'll be slightly better prepared." That's not operational excellence. That's building a museum of past failures.

The better question: what alerting metric would have detected this class of problems before customer impact? What standardized debugging metrics would have made the investigation faster regardless of the specific failure mode?

The Evergreen Exception

There is one scenario where dashboards remain useful: standardized metrics across your entire platform.

If you have consistent alerting metrics for every service (HTTP response latency percentiles, error rates, throughput, async processing times) and dashboards using those standardized metrics to show all services in one view, that can be valuable.

The key is "evergreen." These dashboards stay relevant because they're built on standardized metrics that don't change as individual services evolve. New services automatically appear in these dashboards because they implement the same metric standards. Engineers understand these dashboards because the metrics are consistent across the platform.

But notice: these dashboards only work because of the underlying investment in metric standardization. The dashboard is not the solution. Consistent, well-designed metrics are the solution. The dashboard is just one possible interface to those metrics.

What To Do Instead

If dashboards aren't the answer, what is?

If you're in a truly bad place right now where you have terrible alerts and no consistent metrics, you know what? Poking through random dashboards might actually be your best approach at that point. That's a rough place to be. (And if you find yourself choosing between building your own monitoring infrastructure or using managed solutions, read The Monitoring Trap to understand the hidden costs of that middle ground.)

But there are many ways to avoid that situation in the first place:

Invest in consistent alerting metrics across all services. Every service should expose the same core metrics: request rates, error rates, latency percentiles, resource utilization. When these are standardized, any engineer can investigate any service using the same mental model. This foundation also makes incident response training more effective (see Two-Phase War Games for how to scale this across teams).

Build alerts that tell you where to look and what to check. A good alert doesn't just say "something is wrong." It says: "API latency P95 exceeds 2 seconds, likely database query slowdown, check query performance dashboard or database slow query log."

Provide debugging steps in the alert itself. Link directly to relevant debugging metrics, not to a wall of dashboards. Point to the specific query tool, log search, or metric most likely to explain this class of failure.

Curate, don't proliferate. If you must have dashboards, maintain them actively. Archive dashboards that haven't been viewed in 90 days. Assign ownership. Review quarterly. Treat dashboards as technical debt that requires active management.

Measure dashboard utility honestly. Track which dashboards get used during incidents. If a dashboard hasn't been accessed during the last 10 incidents, it's probably not useful. Delete it.

The Uncomfortable Truth

Dashboards feel productive. Building them feels like improving observability. Looking at them during incidents feels like doing something.

But hunting through dashboards during an incident means your alerts didn't tell you what you needed to know. It means your debugging metrics aren't organized around actual investigation workflows. It means you're relying on institutional knowledge embedded in ad-hoc visualizations instead of systematic observability design.

If your incident response starts with "check the dashboards," you have an alerting problem, not a dashboard shortage.

Stop building more dashboards. Fix your alerts. Standardize your metrics. Provide debugging paths directly from alerts. Then, if you still need dashboards, build the minimum set that adds value.

Your 3 AM self will thank you for clear alerts and standardized metrics, not for access to 47 dashboards that may or may not be relevant to the current incident.

Takeaway

Dashboards are not inherently bad. Over-reliance on them for incident response is a symptom of deeper observability gaps. Before you build another dashboard, ask:

Why don't our alerts detect this?
Why aren't our standardized debugging metrics sufficient for this investigation?
Are we building a reusable observability pattern or a one-time workaround?

Answer those questions first. Then decide if you actually need that dashboard.

Related Content

Featured

Nov 4, 2025

Your Dashboards Are a Code Smell (And How to Fix It)

Nov 4, 2025

Dashboards are the lowest common denominator for monitoring. Over-reliance on them (or truly any reliance on them for production incident response) is a code smell for your observability strategy.

Nov 4, 2025

Oct 28, 2025

Alert Fatigue is Better Than Radio Silence (And That's a Problem)

Oct 28, 2025

Having too many alerts that drive everyone insane is still better than having no alerts at all. I've complained about alert fatigue plenty of times before, but here's the uncomfortable truth: that statement is completely backwards.

Oct 28, 2025

Oct 21, 2025

Designing Monitoring Tools for the Job to Be Done

Oct 21, 2025

Successful monitoring platforms rest on a fundamental principle that many teams overlook: the format of a page should be determined by who you expect to be there and what job they need to accomplish.

This requires purpose-built interfaces, not configuration layers. Different users come to your monitoring platform with completely different needs, and your page design should reflect those differences from the ground up.

Oct 21, 2025

Oct 14, 2025

The On-Premises Revenue Trap - Why Enterprise SaaS Deployments Kill Engineering Velocity

Oct 14, 2025

Enterprise customers love asking for on-prem deployments. The contract values look irresistible: 2-5x your standard SaaS pricing, multi-year commitments, and the validation that comes with enterprise logos. But having managed hybrid and full on-prem deployments across multiple SaaS platforms, I can tell you the operational reality is a trap that strangles engineering teams.

The numbers tell a stark story: research shows that personnel costs represent 50-85% of total on-prem application ownership, with the vast majority of that time spent on monitoring, maintenance, and troubleshooting rather than innovation.

Oct 14, 2025

Oct 7, 2025

The Risk Funnel - Why Your Biggest Project Uncertainties Must Come First

Oct 7, 2025

Every engineering leader has lived this nightmare: two days from deadline, the team discovers the core architectural assumption doesn't work, the third-party API is missing critical functionality, or the algorithm can't handle production scale. A manageable project suddenly needs another week, a 100% schedule overrun.

This scenario highlights why successful engineering leadership requires systematic approaches across project organization and technical oversight, not just individual heroics.

This isn't bad luck. It's predictable project physics that most teams systematically ignore.

Oct 7, 2025

Sep 30, 2025

Why Public Communication Just Got Even More Important - The AI Amplification Effect

Sep 30, 2025

I've written before about the importance of keeping work discussions in public forums: Slack channels, JIRA tickets, shared docs, anywhere that's searchable and accessible. If it's about work, other people probably need to know about it. I've recommended that teams target 60-80% of their messages in public channels to preserve institutional knowledge and make information searchable for future team members.

With AI tools becoming ubiquitous, this practice has transformed from best practice to competitive necessity.

Sep 30, 2025

Sep 23, 2025

The Monitoring Trap - Why Build vs Buy Is the Wrong Question

Sep 23, 2025

Engineering leadership's most expensive monitoring decision isn't choosing the wrong tool. It's falling into the monitoring trap that costs organizations in wasted engineering time and preventable downtime. The classic "build vs buy" framing is fundamentally broken. It ignores how most teams end up trapped in an expensive middle ground that delivers neither cost efficiency nor operational effectiveness, creating cascading impacts on engineering velocity and business outcomes.

Sep 23, 2025

Sep 16, 2025

The Two Types of Engineers And How to Optimize for Both

Sep 16, 2025

Through managing teams across multiple clients, I've observed that engineering productivity isn't just about technical skills. It's about recognizing that different engineers thrive under different working conditions. Recent research from McKinsey's 2024 software engineering productivity study found that companies implementing tailored management approaches achieved a 20% improvement in employee experience scores, validating the importance of matching management style to individual work preferences.

Sep 16, 2025

Sep 9, 2025

Why Your Team's Productivity Drops After Every Change

Sep 9, 2025

You promote your best engineer to team lead. Three weeks later, productivity has tanked and people are frustrated. Sound familiar?

Here's what most engineering leaders don't realize: this productivity drop is completely normal and predictable. When you promote your best engineer, you're getting hit twice. You lose your best individual contributor while the team figures out how to work together. Understanding the four types of engineering leadership helps explain why this transition is so challenging.

Sep 9, 2025

Sep 2, 2025

The Upstream Root Cause Problem - Why Your Production Fires Start in Product Requirements

Sep 2, 2025

Most teams focus on faster incident response. The real solution is preventing incidents from happening in the first place.

After 10+ years of being continuously on-call across multiple SaaS platforms, I've debugged production incidents, database failures, authentication service outages, and scaling crises. Each time, the immediate focus is the same: restore service, minimize customer impact, conduct a post-mortem. Most teams follow a structured incident response process, which is absolutely necessary for operational stability.

But here's what I've learned that most incident response frameworks miss: your operational pain is usually a symptom, not the disease.

Sep 2, 2025

on-calloperations

Brian Conn https://connsulting.io

Your Dashboards Are a Code Smell (And How to Fix It)

The One Dashboard Worth Having

Alerting Metrics vs. Debugging Metrics

The Pet Dashboard Problem

The Reactive Dashboard Trap

The Evergreen Exception

What To Do Instead

The Uncomfortable Truth

Takeaway

Related Content

Connsulting

About

Offerings