I've read hundreds of postmortems across five different companies. The pattern is consistent. The good ones change how the team works the next month. The bad ones get filed in a wiki, search-engine optimized for the next time the same person needs to defend themselves, and produce zero follow-up.
The difference isn't the tooling, the template, or the rigor of the timeline. It's whether the postmortem treats the incident as a system failure or as a person failure. Treat it as a person failure and the document becomes political. Treat it as a system failure and the document becomes useful.
This is the working definition of blameless postmortems that actually hold up. Not "we don't say names." That's the easy part. The harder discipline is structural: every section should be asking what allowed the failure, not who caused it.
The framing question that decides which kind you're writing
Before anyone writes a single sentence, the question I want answered is:
What about our system made this incident possible at 3 AM with a tired on-call engineer?
If the answer is "that engineer pushed a bad deploy," the postmortem is going to be bad. If the answer involves review processes, alerting, deployment automation, on-call training, or system design, the postmortem has somewhere to go.
The reason this question matters: incidents are caused by chains of small decisions and small gaps, not by single bad choices. The on-call engineer who deployed the bug was the last link in a chain that includes code review, CI checks, deployment automation, monitoring, alerting, runbooks, on-call training, and the team's deployment frequency. Most of those links are the team's responsibility, not that engineer's.
The five sections every useful postmortem has
Section names vary by company. The substance doesn't. Every postmortem that produces follow-up has these:
Summary. Two to four sentences. What broke, who was affected, what the impact was in measurable terms. No timeline. No root cause. Just the impact.
Timeline. A minute-by-minute record of what happened, when, and what the responders did. Each entry should be a fact, not an opinion. "3:14 AM — alerted on elevated 5xx rate" beats "3:14 AM — engineer missed the issue." The timeline is the only section that's purely factual.
Contributing factors. This is where the actual investigation lives. List every gap, every missing check, every slow loop that contributed to either the incident or the slow response. Don't rank them yet. Just enumerate.
What we learned. The systemic takeaways. Not "be more careful." Things like "our alerting thresholds for X depend on a constant that's stale" or "the runbook for X assumes the on-call has prod DB access, but most don't."
Action items. Time-bound, owner-assigned, with a deadline. Each item should be small enough that progress is visible in a week. Big items ("rebuild the deploy system") break into smaller ones.
A blank incident review template tells you what to put in each section.
Why timelines are the spine of the document
The timeline is the only section that's truly factual, but it's also the only section that lets everyone agree on what happened. Without it, the postmortem becomes a debate about whose memory is right. With it, the debate moves to "given these facts, what should we change."
A good timeline has:
- Timestamps in a single timezone, ideally UTC.
- Each entry a fact observable in logs, dashboards, or chat.
- Both human actions and automated system actions interleaved.
- Sufficient detail that someone who wasn't there can read the timeline and understand what was happening.
The "what was happening" framing matters. "Engineer X pushed the wrong button" is an interpretation. "Engineer X clicked the rollback button in the deploy UI" is a fact. Stick to facts in the timeline, save interpretation for the contributing factors section.
The contributing factors section is where 90% of the value is
If you skip everything else, do this section well. The goal is to surface every system gap that allowed the incident or made it worse. Common categories to look for:
Process gaps. Reviews that didn't catch what they should have. Deployments that didn't have the safety check your team thought they had. Manual steps that should have been automated. Documentation that was outdated.
Tooling gaps. Alerts that fired too late or didn't fire. Dashboards that didn't have the right metric. Logs missing the trace you needed. Runbooks referring to a UI that had been redesigned.
Knowledge gaps. On-call training that didn't cover this kind of failure. Tribal knowledge about how the system actually behaves. Tribal knowledge about how the tooling actually works.
Design gaps. Architecture decisions that made the failure mode worse. Coupling that turned one component's failure into a cascade. Lack of circuit breakers or rate limits where they would have helped.
- Communication gaps. Who knew what, when, and who didn't know they needed to know. Handoff between timezones. Status page updates that didn't reach users in time.
Each is a different angle to consider. The mistake is to pick one ("the engineer missed the alert") and stop.
What "blameless" actually means in practice
People hear "blameless" and think it means "no one is responsible." That's not it at all. Engineers are responsible for their decisions. Blameless means:
- The postmortem doesn't name a person as the root cause.
- The postmortem doesn't describe a person's decision in judgmental language.
- The postmortem looks for the conditions that made the decision possible.
In practice that means:
- "The deploy was merged without a second reviewer" beats "Engineer X merged their own PR."
- "Our deploy automation allows merges without a second approval" beats "Engineer X bypassed the review requirement."
- "Our runbook for this alert hadn't been updated since 2024" beats "Engineer X followed outdated instructions."
When everyone in the postmortem understands that the document isn't going to read like an HR write-up, they share the actual contributing factors instead of the polished version. The polished version protects people. The honest version fixes systems.
Action items that actually ship
The most common failure mode in postmortems is action items that don't get done. They look reasonable in the meeting. They sit on a list for three months. They never make it into working memory.
Three rules make action items more likely to land:
One owner, not a team. "Platform team" isn't an owner. "Sam" is an owner. If the work isn't owned by a single person with a calendar slot, it doesn't get done.
Small and time-bound. "Fix the deployment system" is too big. "Add a CI check that fails when a deprecated API is used, by 2026-07-15" is small enough that progress is visible in two weeks.
Visible to the rest of the team. Action items live in the team's regular project tracker, not in a wiki that nobody visits. The reminder system that drives the rest of the team's work drives this work too.
If you can't get action items to land, the postmortem is theater. Either fix the action item process or stop writing postmortems. A team that writes 50 postmortems a year and does the follow-up work on 10 of them is worse off than a team that writes 10 and follows up on all 10.
What changes about engineering when this discipline sticks
Teams that run real blameless postmortems long enough start behaving differently:
- Engineers report their own mistakes earlier, because the team's response is "let's fix the system" rather than "let's find who's at fault."
- New engineers get pulled into the postmortem culture as readers and eventually writers. The accumulated knowledge in past postmortems becomes a training surface.
- Engineering process improvements start connecting to incidents instead of being defended abstractly.
- Code review culture evolves. People ask "what would the postmortem for this look like?" before approving risky changes.
- Incident frequency drops. Not because incidents stop, but because the systemic gaps that caused recurring classes of incidents get closed one by one.
The biggest non-obvious benefit: trust. A team that runs postmortems well trusts each other more. Engineers don't hide their mistakes. Juniors aren't afraid to break things on purpose to learn the system. The blast radius of any individual decision shrinks because the scaffolding catches what the individual misses.
The simplest test for whether a postmortem worked
A month after the postmortem, ask three questions:
- Did every action item land, and did the team's behavior change because of it?
- Did anyone reference the postmortem during normal work in the intervening weeks?
- Did the next similar incident either not happen or resolve much faster?
If yes to all three, the postmortem did its job. If no to any, the postmortem was paperwork and the team should change the format. The goal isn't to write good documents. The goal is to get better at shipping reliable software. The documents are just the artifact that makes the improvement traceable.



