Karya Semi
HomeBlogSearchCategoriesAboutContact
Karya Semi

Less noise. More notes.

HomeBlogAboutContactPrivacy PolicyDisclaimer

© 2026 Karya Semi. All rights reserved.

XGitHubLinkedIn
  1. Home
  2. /Categories
  3. /Software Engineering

Error Budgets for Small Engineering Teams: When to Slow Down and When to Ship

Error budgets sound like an SRE concept for big companies. Turns out they're useful for any team that ships software and wants to make deliberate decisions about reliability vs velocity.

Dian Rijal Asyrof/July 3, 2026/4 min read
Illustration for Error Budgets for Small Engineering Teams: When to Slow Down and When to Ship

A team I know hit a rough patch earlier this year. Three incidents in two weeks, each one taking out part of the app during peak hours. After the third one, the engineering lead called a release freeze.

Two weeks. No deployments. Everyone had to stop and fix reliability problems before shipping anything new.

The freeze worked, technically. The app stabilized. But in those two weeks, the team shipped nothing. No features. No improvements. The product fell behind.

There's a better framework for this. It doesn't require freezing everything or letting the app rot. It's called an error budget, and you can implement it with a spreadsheet and a monitoring tool you probably already have.

What an Error Budget Actually Is

An error budget is the amount of unreliability you're willing to tolerate before you change how you work.

It starts with an SLO a Service Level Objective. Your SLO is a target, not a promise. "We want 99.9% uptime" means you accept 0.1% downtime as normal. That 0.1% is your error budget.

Let's do the math for a 30-day month:

  • 99.9% SLO: 43 minutes of allowed downtime per month
  • 99.5% SLO: 3.6 hours of allowed downtime per month
  • 99% SLO: 7.3 hours of allowed downtime per month

The budget isn't a punishment. It's a resource. When you have budget left, you spend it on shipping features. When you burn through it, you shift focus to reliability work.

This is the key insight: the budget gives you a concrete, data-driven reason to slow down. Not feelings. Not a gut check. You looked at the numbers and the numbers said you need to stop.

How to Calculate Yours

You need two things: a monitoring tool and a target.

For monitoring, anything that can track request success rates works. Grafana, Datadog, New Relic, even a custom metric you're writing to Prometheus. The point is: you need to measure how often your service fails, not just whether it fails.

For the target, start loose. 99.5% is easier to hit and less demoralizing than 99.9%. You can tighten it later.

The formula:

Error budget consumed = (Actual errors / Allowed errors) x 100%

If your SLO is 99.5%, you allow 0.5% errors per month. If you're running at 0.3% actual errors, you've consumed 60% of your budget with 40% remaining.

Graph this over time. If you started the month with a clean slate and you're already at 80% budget consumed by day 15, you know what's coming. The math tells you to slow down before you're in the red.

What Slowing Down Actually Looks Like

"Stop shipping" is not the answer. The answer is: be more careful about what ships.

When an error budget is stressed, a few concrete actions:

Extend the staging period. If you normally deploy after 20 minutes of staging, wait for an hour. Run more smoke tests. Actually look at the logs before promoting.

Require sign-off on changes. Not a formal process just the on-call engineer glancing at what changed before it goes out. One extra pair of eyes catches surprisingly much.

Hold feature flags. Big changes that aren't ready for traffic go behind a flag. Ship the code without exposing users to it. You can roll it out when the budget recovers.

Postpone risky refactors. Database migrations, infrastructure changes, dependency upgrades. These can wait. Move them off the sprint until you're stable.

None of this is dramatic. It's just treating the deployment pipeline with a bit more caution for a limited time.

When You Have Budget, Spend It

This part gets ignored. Teams treat error budgets like a zero-sum game where you always need to be at 99.9%.

If you're consistently running at 99.9% with budget to spare, you're being too conservative. You could:

Ship faster. Cut staging from an hour to 20 minutes. Enable auto-deploy. Move faster because you have the safety net.

Take on technical debt. That migration you've been avoiding? The library upgrade you keep postponing? When reliability is healthy, you have room to absorb a bad outcome.

Reduce monitoring overhead. If you're at 99.99% and you don't care about 99.9%, you can simplify alerting. Less noise means on-call engineers actually pay attention when something real happens.

The goal isn't zero incidents. It's deliberate decisions about risk. A team that's always at 99.9% is either extremely well-resourced or underinvesting in product velocity.

Getting Started Without SRE Expertise

You don't need the full SRE playbook. Start small.

Track one metric. Pick the most important user-facing endpoint or API the thing that, if it breaks, everyone notices. Track its success rate over time. That's your SLO.

Set a target. 99.5% is fine for most internal tools and early-stage products. B2C apps with revenue on the line might want 99.9%. Pick what matches your actual business requirements.

Check weekly. Every Monday, look at how much budget you've consumed. If you're above 80%, start being more careful. If you're above 100%, you've already burned through and you know what that means.

Graph it. A simple line chart of budget consumed over time tells the story. The team can see: "we were fast in March, hit problems in April, recovered in May." Patterns emerge without anyone having to guess.

Common Mistakes

Setting SLOs for everything. You don't need a budget for every metric. Pick 2-3 that matter. Tracking 40 different SLIs creates noise, not insight.

Choosing targets based on perfectionism. "We want 100% uptime" gives you a 0% error budget. Any incident consumes 100% of your budget immediately. There's no room to be deliberate.

Ignoring the budget. Checking it once a quarter defeats the purpose. The budget is useful as a leading indicator. Weekly review catches problems before they become crises.

Treating reliability work as punishment. The release freeze I described at the start happened because the team was embarrassed, not because they had data. When the budget says slow down, it's not a failure it's information. The difference matters for team morale.

The Underrated Part

Error budgets work because they separate fact from feeling. When someone says "we're moving too fast," the answer used to be: "I disagree, I think we're fine." Now it's: "let's check the budget."

That changes the conversation. You're not debating whether the team is reckless. You're looking at actual data and making a decision together. The budget removes the moral dimension and replaces it with math.

For small teams where everyone ships everything, this matters. You don't have time for long processes. You need to make fast decisions and move on. An error budget gives you a rule of thumb that doesn't require a committee.

Sources

  • Google SRE Book: Service Level Objectives
  • Beyond SLOs: Error Budgets Explained Atlassian Engineering
  • Incident Review Practices: How to Run Effective Post-Mortems PagerDuty
DR

Dian Rijal Asyrof

Writes about useful AI tools, programming practice, and the craft of building reliable software.

Previous articleERC-7579 Smart Accounts: What Changes for Ethereum Developers
Software EngineeringReliabilitySREEngineering CultureReleases
On this page↓
  1. What an Error Budget Actually Is
  2. How to Calculate Yours
  3. What Slowing Down Actually Looks Like
  4. When You Have Budget, Spend It
  5. Getting Started Without SRE Expertise
  6. Common Mistakes
  7. The Underrated Part
  8. Sources

On this page

  1. What an Error Budget Actually Is
  2. How to Calculate Yours
  3. What Slowing Down Actually Looks Like
  4. When You Have Budget, Spend It
  5. Getting Started Without SRE Expertise
  6. Common Mistakes
  7. The Underrated Part
  8. Sources

See also

Illustration for Blameless Postmortems That Actually Improve Your Engineering Team
Software Engineering/Jun 30, 2026

Blameless Postmortems That Actually Improve Your Engineering Team

Most postmortems turn into root-cause essays that name a person and a button. Here's how to write incident reviews that change how your team ships software.

6 min read
Software EngineeringIncidents
Illustration for GitHub's Advisory Database Hit 1,560 CVEs in May. Here's Why That Matters.
Software Engineering/Jun 30, 2026

GitHub's Advisory Database Hit 1,560 CVEs in May. Here's Why That Matters.

GitHub's Advisory Database processed 5x its normal volume in May. Private vulnerability reports jumped from 550 to 3,000 per week. Here's the impact and how teams should respond.

3 min read
Software EngineeringSecurity
Illustration for Incident Review Template for Small Engineering Teams
Software Engineering/Jun 30, 2026

Incident Review Template for Small Engineering Teams

A practical incident review template for small teams: timeline, impact, root causes, action items, meeting agenda, and follow-up habits that actually stick.

6 min read
IncidentsSRE