Brian Conn Brian Conn

The Upstream Root Cause Problem - Why Your Production Fires Start in Product Requirements

Most teams focus on faster incident response. The real solution is preventing incidents from happening in the first place.

After 10+ years of being continuously on-call across multiple SaaS platforms, I've debugged production incidents, database failures, authentication service outages, and scaling crises. Each time, the immediate focus is the same: restore service, minimize customer impact, conduct a post-mortem. Most teams follow a structured incident response process, which is absolutely necessary for operational stability.

But here's what I've learned that most incident response frameworks miss: your operational pain is usually a symptom, not the disease.

Read More
Brian Conn Brian Conn

Two-Phase War Games - Scaling Incident Response Training Across Multiple Teams

Traditional war games fall apart when multiple teams try to learn incident response and team dynamics simultaneously. This two-phase approach separates the challenges: homogeneous sessions build incident response skills within existing teams, while heterogeneous sessions focus on cross-team coordination with a shared foundation.

Read More
operations, production Brian Conn operations, production Brian Conn

The 5 Stages of a Production Incident

Here’s a bit of a paradox: the better you are at solving SaaS production incidents, the harder each incident is to solve.

At first glance, this doesn’t make a lot of sense. Wouldn’t being better make solving production incidents easier? No. The trick is that once you get good at production incidents, you don’t get hit with the easy ones anymore: you solve them for good. That leaves only the new and challenging problems for you to solve. The average incident is more complex, but your reward is that the frequency of incidents goes way down.

I’d take that trade any day.

Read More
operations, culture Brian Conn operations, culture Brian Conn

Why Exciting Operations are Bad

A little excitement in your job is usually a good thing. It could be learning a new development language, preparing to release a new feature, or taking on new responsibilities as part of a promotion. That’s great for most jobs, but not operations. Let me tell you why.

Read More