Modern systems don't fail quietly-and neither do the organisations that run them.
When critical technology breaks, the real challenge is rarely the code. It's decision-making under pressure, communication under uncertainty, and leadership when the cost of mistakes is measured in trust, revenue, and real-world impact.
Drawing on years of experience managing incidents in high-stakes financial and banking environments, this book reframes incident management as what it truly is: a leadership discipline, not a debugging exercise.
Using clear analogies from emergency response-fires, floods, earthquakes, and tsunamis-Incident Management provides a practical, human-centred guide to staying effective when everything is on the line.
✔ What an incident really is-and why impact matters more than urgency
✔ How to stabilise situations in the first five critical minutes
✔ Why clearly defined roles outperform heroics under pressure
✔ How communication becomes infrastructure during outages
✔ Techniques for decision-making when information is incomplete
✔ How to manage cascading failures, dependency shocks, and systemic events
✔ Practical guidance on observability, triage, and recovery
✔ How to run blameless post-incident reviews with real accountability
✔ How to build resilient systems and resilient people
This book covers the full lifecycle of incident management, including:
Incident command and team structure
Crisis communication for technical and non-technical stakeholders
Flood control patterns such as rate limiting and backpressure
Dependency failures and staged recovery
Customer trust during outages
Security and compliance incidents
Automation, drills, and simulation
Metrics that matter-and those that mislead
The psychology of stress, fatigue, and group dynamics
Each chapter combines practical frameworks with timeless insight, supported by reflections from philosophy, leadership, and emergency management.
This book is written for:
Technology leaders and engineering managers
Incident commanders and on-call engineers
SRE, platform, and reliability teams
Executives responsible for critical systems
Anyone expected to lead calmly when systems fail
No prior incident management framework is required-just the responsibility to act when things go wrong.
Most books focus on tools, dashboards, and postmortems.
This one focuses on how people think, communicate, and decide under pressure.
Because when systems fail, leadership-not technology-determines the outcome.