postmortem

What is postmortem?

A postmortem is a written review that a team creates after a system outage, bug, security breach, or any other incident. It details what happened, why it happened, and what steps will be taken to avoid the same problem in the future. Think of it as a “lessons learned” report for technical failures.

Let's break it down

Detect the incident - Note the exact time the problem started and when it was resolved.
Gather data - Collect logs, monitoring graphs, alerts, and any user reports.
Build a timeline - Put events in order to see the chain of cause and effect.
Identify the root cause - Use techniques like “5 Whys” or fishbone diagrams to find the underlying issue.
Create action items - List concrete fixes, process changes, or tooling improvements.
Share and review - Distribute the document to the team, discuss it in a meeting, and track the follow‑up tasks.

Why does it matter?

Prevents repeat incidents by fixing the real problem, not just the symptom.
Improves reliability of services, leading to happier users and lower downtime costs.
Builds a learning culture where mistakes are examined openly rather than hidden.
Provides documentation for future engineers who may face similar issues.
Supports accountability by showing who did what and where processes can be better.

Where is it used?

Site Reliability Engineering (SRE) and DevOps teams after outages.
Software development groups after production bugs.
Security teams after breaches or vulnerability exploits.
IT operations when hardware fails or network incidents occur.
Any organization that runs critical systems and wants to continuously improve.

Good things about it

Turns painful incidents into actionable knowledge.
Encourages cross‑team collaboration and shared responsibility.
Creates a historical record that speeds up future troubleshooting.
Helps prioritize engineering work based on real impact.
Fosters transparency with stakeholders and customers when shared appropriately.

Not-so-good things

Can be time‑consuming, especially if data is scattered or incomplete.
If handled poorly, it may turn into a blame‑game rather than a learning exercise.
Some teams treat postmortems as a checkbox, resulting in low‑quality reports.
Over‑documenting minor incidents can waste resources.
Repeatedly ignoring action items defeats the purpose and erodes trust.