Error Budgets for Small Engineering Teams: When to Slow Down and When to Ship

A team I know hit a rough patch earlier this year. Three incidents in two weeks, each one taking out part of the app during peak hours. After the third one, the engineering lead called a release freeze.

Two weeks. No deployments. Everyone had to stop and fix reliability problems before shipping anything new.

The freeze worked, technically. The app stabilized. But in those two weeks, the team shipped nothing. No features. No improvements. The product fell behind.

There's a better framework for this. It doesn't require freezing everything or letting the app rot. It's called an error budget, and you can implement it with a spreadsheet and a monitoring tool you probably already have.

What an Error Budget Actually Is

An error budget is the amount of unreliability you're willing to tolerate before you change how you work.

It starts with an SLO a Service Level Objective. Your SLO is a target, not a promise. "We want 99.9% uptime" means you accept 0.1% downtime as normal. That 0.1% is your error budget.

Let's do the math for a 30-day month:

99.9% SLO: 43 minutes of allowed downtime per month
99.5% SLO: 3.6 hours of allowed downtime per month
99% SLO: 7.3 hours of allowed downtime per month

The budget isn't a punishment. It's a resource. When you have budget left, you spend it on shipping features. When you burn through it, you shift focus to reliability work.

This is the key insight: the budget gives you a concrete, data-driven reason to slow down. Not feelings. Not a gut check. You looked at the numbers and the numbers said you need to stop.

How to Calculate Yours

You need two things: a monitoring tool and a target.

For monitoring, anything that can track request success rates works. Grafana, Datadog, New Relic, even a custom metric you're writing to Prometheus. The point is: you need to measure how often your service fails, not just whether it fails.

For the target, start loose. 99.5% is easier to hit and less demoralizing than 99.9%. You can tighten it later.

The formula:

Error budget consumed = (Actual errors / Allowed errors) x 100%

If your SLO is 99.5%, you allow 0.5% errors per month. If you're running at 0.3% actual errors, you've consumed 60% of your budget with 40% remaining.

Graph this over time. If you started the month with a clean slate and you're already at 80% budget consumed by day 15, you know what's coming. The math tells you to slow down before you're in the red.

What Slowing Down Actually Looks Like

"Stop shipping" is not the answer. The answer is: be more careful about what ships.

When an error budget is stressed, a few concrete actions:

Extend the staging period. If you normally deploy after 20 minutes of staging, wait for an hour. Run more smoke tests. Actually look at the logs before promoting.

Require sign-off on changes. Not a formal process just the on-call engineer glancing at what changed before it goes out. One extra pair of eyes catches surprisingly much.

Hold feature flags. Big changes that aren't ready for traffic go behind a flag. Ship the code without exposing users to it. You can roll it out when the budget recovers.

Postpone risky refactors. Database migrations, infrastructure changes, dependency upgrades. These can wait. Move them off the sprint until you're stable.

None of this is dramatic. It's just treating the deployment pipeline with a bit more caution for a limited time.

When You Have Budget, Spend It

This part gets ignored. Teams treat error budgets like a zero-sum game where you always need to be at 99.9%.

If you're consistently running at 99.9% with budget to spare, you're being too conservative. You could:

Ship faster. Cut staging from an hour to 20 minutes. Enable auto-deploy. Move faster because you have the safety net.

Take on technical debt. That migration you've been avoiding? The library upgrade you keep postponing? When reliability is healthy, you have room to absorb a bad outcome.

Reduce monitoring overhead. If you're at 99.99% and you don't care about 99.9%, you can simplify alerting. Less noise means on-call engineers actually pay attention when something real happens.

The goal isn't zero incidents. It's deliberate decisions about risk. A team that's always at 99.9% is either extremely well-resourced or underinvesting in product velocity.

Getting Started Without SRE Expertise

You don't need the full SRE playbook. Start small.

Track one metric. Pick the most important user-facing endpoint or API the thing that, if it breaks, everyone notices. Track its success rate over time. That's your SLO.

Set a target. 99.5% is fine for most internal tools and early-stage products. B2C apps with revenue on the line might want 99.9%. Pick what matches your actual business requirements.

Check weekly. Every Monday, look at how much budget you've consumed. If you're above 80%, start being more careful. If you're above 100%, you've already burned through and you know what that means.

Graph it. A simple line chart of budget consumed over time tells the story. The team can see: "we were fast in March, hit problems in April, recovered in May." Patterns emerge without anyone having to guess.

Common Mistakes

Setting SLOs for everything. You don't need a budget for every metric. Pick 2-3 that matter. Tracking 40 different SLIs creates noise, not insight.

Choosing targets based on perfectionism. "We want 100% uptime" gives you a 0% error budget. Any incident consumes 100% of your budget immediately. There's no room to be deliberate.

Ignoring the budget. Checking it once a quarter defeats the purpose. The budget is useful as a leading indicator. Weekly review catches problems before they become crises.

Treating reliability work as punishment. The release freeze I described at the start happened because the team was embarrassed, not because they had data. When the budget says slow down, it's not a failure it's information. The difference matters for team morale.

The Underrated Part

Error budgets work because they separate fact from feeling. When someone says "we're moving too fast," the answer used to be: "I disagree, I think we're fine." Now it's: "let's check the budget."

That changes the conversation. You're not debating whether the team is reckless. You're looking at actual data and making a decision together. The budget removes the moral dimension and replaces it with math.

For small teams where everyone ships everything, this matters. You don't have time for long processes. You need to make fast decisions and move on. An error budget gives you a rule of thumb that doesn't require a committee.

Sources

Two weeks. No deployments. Everyone had to stop and fix reliability problems before shipping anything new.

The freeze worked, technically. The app stabilized. But in those two weeks, the team shipped nothing. No features. No improvements. The product fell behind.

What an Error Budget Actually Is

An error budget is the amount of unreliability you're willing to tolerate before you change how you work.

It starts with an SLO a Service Level Objective. Your SLO is a target, not a promise. "We want 99.9% uptime" means you accept 0.1% downtime as normal. That 0.1% is your error budget.

Let's do the math for a 30-day month:

99.9% SLO: 43 minutes of allowed downtime per month
99.5% SLO: 3.6 hours of allowed downtime per month
99% SLO: 7.3 hours of allowed downtime per month

The budget isn't a punishment. It's a resource. When you have budget left, you spend it on shipping features. When you burn through it, you shift focus to reliability work.

This is the key insight: the budget gives you a concrete, data-driven reason to slow down. Not feelings. Not a gut check. You looked at the numbers and the numbers said you need to stop.

How to Calculate Yours

You need two things: a monitoring tool and a target.

For the target, start loose. 99.5% is easier to hit and less demoralizing than 99.9%. You can tighten it later.

The formula:

Error budget consumed = (Actual errors / Allowed errors) x 100%

If your SLO is 99.5%, you allow 0.5% errors per month. If you're running at 0.3% actual errors, you've consumed 60% of your budget with 40% remaining.

What Slowing Down Actually Looks Like

"Stop shipping" is not the answer. The answer is: be more careful about what ships.

When an error budget is stressed, a few concrete actions:

Extend the staging period. If you normally deploy after 20 minutes of staging, wait for an hour. Run more smoke tests. Actually look at the logs before promoting.

Require sign-off on changes. Not a formal process just the on-call engineer glancing at what changed before it goes out. One extra pair of eyes catches surprisingly much.

Hold feature flags. Big changes that aren't ready for traffic go behind a flag. Ship the code without exposing users to it. You can roll it out when the budget recovers.

Postpone risky refactors. Database migrations, infrastructure changes, dependency upgrades. These can wait. Move them off the sprint until you're stable.

None of this is dramatic. It's just treating the deployment pipeline with a bit more caution for a limited time.

When You Have Budget, Spend It

This part gets ignored. Teams treat error budgets like a zero-sum game where you always need to be at 99.9%.

If you're consistently running at 99.9% with budget to spare, you're being too conservative. You could:

Ship faster. Cut staging from an hour to 20 minutes. Enable auto-deploy. Move faster because you have the safety net.

Take on technical debt. That migration you've been avoiding? The library upgrade you keep postponing? When reliability is healthy, you have room to absorb a bad outcome.

Reduce monitoring overhead. If you're at 99.99% and you don't care about 99.9%, you can simplify alerting. Less noise means on-call engineers actually pay attention when something real happens.

The goal isn't zero incidents. It's deliberate decisions about risk. A team that's always at 99.9% is either extremely well-resourced or underinvesting in product velocity.

Getting Started Without SRE Expertise

You don't need the full SRE playbook. Start small.

Track one metric. Pick the most important user-facing endpoint or API the thing that, if it breaks, everyone notices. Track its success rate over time. That's your SLO.

Set a target. 99.5% is fine for most internal tools and early-stage products. B2C apps with revenue on the line might want 99.9%. Pick what matches your actual business requirements.

Check weekly. Every Monday, look at how much budget you've consumed. If you're above 80%, start being more careful. If you're above 100%, you've already burned through and you know what that means.

Common Mistakes

Setting SLOs for everything. You don't need a budget for every metric. Pick 2-3 that matter. Tracking 40 different SLIs creates noise, not insight.

Choosing targets based on perfectionism. "We want 100% uptime" gives you a 0% error budget. Any incident consumes 100% of your budget immediately. There's no room to be deliberate.

Ignoring the budget. Checking it once a quarter defeats the purpose. The budget is useful as a leading indicator. Weekly review catches problems before they become crises.

The Underrated Part

Error budgets work because they separate fact from feeling. When someone says "we're moving too fast," the answer used to be: "I disagree, I think we're fine." Now it's: "let's check the budget."

Error Budgets for Small Engineering Teams: When to Slow Down and When to Ship

What an Error Budget Actually Is

How to Calculate Yours

What Slowing Down Actually Looks Like

When You Have Budget, Spend It

Getting Started Without SRE Expertise

Common Mistakes

The Underrated Part

Sources

Blameless Postmortems That Actually Improve Your Engineering Team

GitHub's Advisory Database Hit 1,560 CVEs in May. Here's Why That Matters.

Incident Review Template for Small Engineering Teams

Error Budgets for Small Engineering Teams: When to Slow Down and When to Ship

What an Error Budget Actually Is

How to Calculate Yours

What Slowing Down Actually Looks Like

When You Have Budget, Spend It

Getting Started Without SRE Expertise

Common Mistakes

The Underrated Part

Sources

Blameless Postmortems That Actually Improve Your Engineering Team

GitHub's Advisory Database Hit 1,560 CVEs in May. Here's Why That Matters.

Incident Review Template for Small Engineering Teams

What an Error Budget Actually Is

How to Calculate Yours

What Slowing Down Actually Looks Like

When You Have Budget, Spend It

Getting Started Without SRE Expertise

Common Mistakes

The Underrated Part

Sources

See also

Blameless Postmortems That Actually Improve Your Engineering Team

GitHub's Advisory Database Hit 1,560 CVEs in May. Here's Why That Matters.

Incident Review Template for Small Engineering Teams

What an Error Budget Actually Is

How to Calculate Yours

What Slowing Down Actually Looks Like

When You Have Budget, Spend It

Getting Started Without SRE Expertise

Common Mistakes

The Underrated Part

Sources

See also

Blameless Postmortems That Actually Improve Your Engineering Team

GitHub's Advisory Database Hit 1,560 CVEs in May. Here's Why That Matters.

Incident Review Template for Small Engineering Teams