Two-Phase War Games - Scaling Incident Response Training Across Multiple Teams
Traditional war games work great for single teams. But what happens when you have three subsystem teams, a dedicated SRE group, and multiple stakeholders who all need to respond to incidents together?
The stakes are higher than most organizations realize. Downtime now costs companies an average of $14,056 per minute, with some outages lasting between 30 minutes and 2 hours. Yet most war games (if you're even running them) fall apart at scale because we're trying to teach team dynamics and incident response simultaneously.
The Multi-Team War Games Problem
This isn't just a training problem. It's a business risk. Most organizations struggle with coordinated incident response across multiple teams, leading to longer resolution times and increased business impact during critical outages.
Here's what goes wrong: Engineers who barely know each other were trying to learn incident commander roles, communication protocols, and cross-team coordination all at once. Instead of focusing on systematic debugging and clear communication, we spend most of the session figuring out how to form a team.
The problem isn't the war games concept itself. It's that we're asking teams to perform two completely different skills simultaneously (and as we know, most problems are fundamentally communication problems):
- Learning incident response roles and communication patterns
- Building ad hoc working relationships under pressure
The Two-Phase Solution: Homogeneous Then Heterogeneous
The solution lies in separating these challenges into two distinct phases, each with different goals and team compositions.
Phase 1: Homogeneous War Games
Team Composition: Single subsystem team (3-5 people who work together daily) Duration: 90-120 minutes per session Focus: Learning incident response roles and communication patterns
In homogeneous war games, we work with teams who already have established working relationships. If you have subsystem teams A, B, and C, run separate sessions with each team.
Key roles to practice:
- Incident Commander: Coordinates response and makes decisions
- Internal Liaison: Communicates with other engineering teams
- External Liaison: Handles customer and stakeholder communication
- Scribe: Documents timeline and decisions for post-mortem
What I do as game master: I simulate all the other teams. When team A needs information from subsystem B, I respond as team B. When they need to escalate to executives, I play the executive role. This allows teams to practice the systematic approach to incident response without the complexity of real cross-team dynamics.
Why this works: Teams can focus entirely on learning systematic debugging approaches and communication protocols without the added complexity of building relationships with strangers under pressure. Since communication problems are at the root of most team challenges, mastering these protocols within trusted relationships first is crucial.
Tooling Integration: During homogeneous sessions, teams practice using actual incident management tools like PagerDuty, your monitoring dashboards, and communication platforms. This isn't just role-playing. Teams should be triggering real alerts, using actual runbooks, and practicing the exact workflows they'll use during production incidents. The game master can simulate external dependencies while teams use production tooling.
Phase 2: Heterogeneous War Games
Team Composition: One representative from each subsystem team plus SRE Duration: 90-150 minutes Focus: Ad hoc team formation and cross-team incident coordination
Only after all teams have completed homogeneous training do we run heterogeneous war games. At this point, everyone speaks the common language of incident response roles.
What changes: Instead of simulating other teams, I pull real representatives from each subsystem. The team might include one person from the API team, one from the frontend team, one from the database team, and one from SRE.
The real learning: This is where teams experience the full forming, storming, norming, and performing cycle under incident pressure. They have to quickly establish working relationships, coordinate across unfamiliar systems, and resolve complex issues involving multiple subsystems.
Why This Two-Phase Approach Works
This approach works because we're respecting the natural team formation process. In real incidents, you often have:
- Forming: Who's on this incident response team?
- Storming: How do we coordinate across these different systems?
- Norming: What's our process for sharing information and making decisions?
- Performing: Actually resolving the incident
Trying to do all four stages while also learning incident response roles is overwhelming. The cost of getting this wrong is substantial. Global 2000 companies lose $400 billion annually to unplanned downtime, representing 9% of their total profits. Homogeneous war games let teams master the norming and performing stages with people they trust. Heterogeneous war games then focus on the forming and storming challenges with a shared foundation.
Remember, this is practice. Don't try to work through these stages for the first time during a real incident.
Getting Started This Week
If you have a single team: Start with homogeneous war games. Even if you don't have multiple subsystem teams yet, practicing roles and communication patterns within your current team builds the foundation for future scaling.
If you have multiple teams: Begin with one homogeneous session per team over the next month. Don't rush to heterogeneous war games until each team feels comfortable with incident response roles.
The goal isn't perfect incident response on day one. It's building the communication patterns and team dynamics that make real incidents manageable instead of chaotic.
Need Help?
Need help running War Games within your organization? Contact me at brian@connsulting.io.