The Upstream Root Cause Problem - Why Your Production Fires Start in Product Requirements
Most teams focus on faster incident response. The real solution is preventing incidents from happening in the first place.
After 10+ years of being continuously on-call across multiple SaaS platforms, I've debugged production incidents, database failures, authentication service outages, and scaling crises. Each time, the immediate focus is the same: restore service, minimize customer impact, conduct a post-mortem. Most teams follow a structured incident response process, which is absolutely necessary for operational stability.
But here's what I've learned that most incident response frameworks miss: your operational pain is usually a symptom, not the disease.
Two-Phase War Games - Scaling Incident Response Training Across Multiple Teams
Traditional war games fall apart when multiple teams try to learn incident response and team dynamics simultaneously. This two-phase approach separates the challenges: homogeneous sessions build incident response skills within existing teams, while heterogeneous sessions focus on cross-team coordination with a shared foundation.
The Three Levels of Time Management
Time management lives alongside prioritization and communication as the foundational skills for being an effective team leader. Team leads (alongside everyone else in most organizations) have more work than they can handle, so what work should they do and when? The key to these decisions is understanding the three levels of time management.
How to Read a Burndown Chart
How long do you spend talking about your burndown chart during your sprint retrospectives? Burndown charts tell you so much more than if you finished all the work in the sprint or not. From these charts, you can learn:
How well your team is estimating story points
How many injections occur during the sprint
How well your team is breaking down tickets
Bottlenecks in your team’s software development pipeline
We’ll walk through a few examples of burndown charts in this article and discuss what we can learn.
Well Qualified vs Uniquely Qualified
When I was a backend team lead I would sometimes jump in and help during sprints by writing code or diving into operations. Occasionally I would even be the best person for the job because I had domain knowledge for that service or sub-system.
So why do I always prioritize dev work dead last on my list of to-dos?
Everything is a Communication Problem
I like to joke that every problem, at its core, is a communication problem.
It’s true more often than not, mainly when the problem involves well-meaning individuals.
SaaS Developer Priorities
Production SaaS platforms require operations maintenance, support, tech debt payments, bug fixes, and more. This isn’t even counting the feature work customers, sales, and PM are asking for.
So with all this work to do, how can we manage what to do when? How can we as a team agree on our shared day-to-day priorities? This is a critical challenge to solve, especially now that remote work is so prevalent.
SaaS War Games - Part 3: Running a War Game
Planning is critical to running a successful War Game. One of the core goals is for the incident to feel real, so expect to spend 3-4x the amount of time planning the War Game as you spend running it.
SaaS War Games - Part 2: War Game Basics
In the first article of this series, we identified a few challenges of production incidents. They’re fast, filled with pressure, and are (hopefully) brand new failures. If the best-case scenario is a new failure (remember: repeated failures mean we never solved it the first time), how can we practice?
SaaS War Games - Part 1: Getting Comfortable with Being Uncomfortable
It’s 3 AM and your phone is ringing. There’s only one number you let ring through your Do Not Disturb settings. You open one eye and look at the first of 12 on-call notifications.
“Database down. Need help.”
It’s gonna be a long night.