The Science of Reliability: Why 'it usually works' is the most expensive sentence in tech
The Most Dangerous Phrase in Technology
Someone says “it usually works” and everyone nods. The meeting continues. The feature ships. Nobody asks the obvious follow-up: what happens when it doesn’t?
I’ve heard this phrase in hundreds of technical discussions. It’s always delivered with a casual shrug, as if the qualifier “usually” is a minor footnote rather than a flashing warning sign. We treat 95% reliability as if it were 100%. We mentally round up.
This rounding kills projects. Sometimes slowly, through accumulated customer frustration. Sometimes quickly, through spectacular failures that make the news. Either way, the cost of “usually” always comes due.
My cat Arthur just jumped on my desk to investigate what I’m typing. He has a 100% success rate at interrupting my work at inconvenient moments. If only our software were as reliable as his persistence.
The Mathematics of “Usually”
Let’s talk about what “usually works” actually means in practice. When someone says a system “usually works,” they typically mean something like 95% reliability. That sounds pretty good. It’s an A grade in most schools.
Here’s the problem. If a system runs 95% reliably and processes 1,000 transactions per day, you’re looking at 50 failures every single day. If it runs 100,000 transactions, that’s 5,000 daily failures. At scale, “usually” becomes a disaster.
The math gets worse with compound reliability. If you have five components that each work 95% of the time, and they all need to work for the system to function, your actual reliability is 0.95^5 = 77%. Suddenly that A grade looks more like a C minus.
Modern systems have dozens of components. Microservices. Third-party APIs. Database connections. Message queues. Network hops. Each one with its own “usually works” disclaimer. Multiply enough 95% probabilities together and you get something that fails constantly.
This is basic probability, but it’s knowledge that’s becoming rarer. Automation tools hide the complexity. They abstract away the failure modes. They make unreliable systems feel reliable until the moment they catastrophically aren’t.
The Skill Erosion Behind “Good Enough”
There was a time when building reliable systems required deep understanding of failure modes. Engineers knew exactly what could go wrong because they’d experienced it. They built redundancy not from theoretical knowledge but from practical pain.
Today’s tools are so good at hiding failures that many engineers never develop this intuition. Retry logic is automatic. Failover is built-in. Circuit breakers activate without human intervention. The system handles problems before anyone notices them.
This sounds like progress. And in many ways, it is. But it comes with a hidden cost: engineers stop learning what reliability actually requires. They trust the tools to handle it. They lose the skill of anticipating failure.
I call this reliability blindness. The tools work so well that we stop seeing the underlying fragility. We build systems that depend on perfect conditions and assume the tools will maintain those conditions forever.
When something truly novel goes wrong—something the tools weren’t designed to handle—we’re suddenly helpless. The muscle memory for debugging isn’t there. The intuition for failure modes hasn’t developed. We stare at dashboards without understanding what we’re seeing.
The Automation Complacency Spiral
Here’s how automation complacency works in the context of reliability. A team builds a system. It fails sometimes. They add monitoring and automated recovery. The failures become invisible. Success is declared.
But the system is still failing. The failures are just being hidden, papered over by automation. Each hidden failure is a signal that something is wrong—a signal that’s now being ignored.
Over time, the underlying problems compound. Technical debt accumulates. The system becomes more fragile even as the dashboards show green. Engineers lose awareness of how close to the edge they’re operating.
Then something changes. A traffic spike. A new feature. A dependency update. The automated recovery can’t keep up. The hidden failures become visible all at once. And nobody understands why because nobody was watching the hidden failures accumulate.
This is the automation complacency spiral. Better tools lead to less attention. Less attention leads to more hidden problems. More hidden problems lead to bigger failures. Bigger failures demand even better tools. The cycle continues.
graph TD
A[Automation handles failures] --> B[Engineers stop monitoring]
B --> C[Hidden problems accumulate]
C --> D[Major failure occurs]
D --> E[Add more automation]
E --> A
Method
Let me explain how I evaluate reliability claims. This framework emerged from years of watching “it usually works” systems fail, often at the worst possible moments.
Step 1: Define “works” precisely.
When someone says something works, ask them to define exactly what that means. Not vague descriptions—specific, measurable outcomes. What constitutes success? What constitutes failure? Where’s the boundary?
Most reliability problems stem from unclear definitions. A system might “work” in the sense that it responds to requests but “fail” in the sense that responses are sometimes wrong. Without clear definitions, you can’t measure reliability.
Step 2: Identify all failure modes.
List every way the system can fail. Not just the obvious ones—the subtle ones too. Network timeouts. Race conditions. Resource exhaustion. Data corruption. Clock drift. Every system has failure modes its creators didn’t anticipate.
This is where experience matters. Someone who’s seen systems fail develops pattern recognition for failure modes. Someone who hasn’t relies on imagination, which is always insufficient. The gap between these is often the gap between reliable and unreliable systems.
Step 3: Measure actual failure rates.
Don’t accept estimated reliability. Measure it. Look at logs. Count failures. Calculate percentages. “Usually works” should become “works 94.7% of the time” or “fails 12 times per day.”
The act of measurement changes everything. Vague optimism becomes concrete data. Hidden problems become visible. Teams start taking reliability seriously when they see the numbers.
Step 4: Map the blast radius.
When this system fails, what else fails with it? What depends on it? Who gets called at 3 AM? What’s the customer impact?
Some failures are isolated. A service returns an error; the user retries; life continues. Other failures cascade. One component fails; ten others notice and fail; the whole system goes down.
Understanding blast radius changes how you prioritize reliability work. A 99% reliable system with small blast radius might be fine. A 99.9% reliable system with enormous blast radius might still be too risky.
Step 5: Test failure handling.
Don’t assume recovery mechanisms work. Test them. Deliberately break things. Kill processes. Disconnect networks. Inject errors. See what actually happens versus what’s supposed to happen.
Most teams are afraid to break their own systems. This fear is rational—breaking production is scary—but it creates blind spots. You can’t trust recovery mechanisms you’ve never seen activate.
Step 6: Calculate the true cost of failure.
Put a number on reliability failures. Engineering time for recovery. Customer support costs. Lost revenue. Reputation damage. Contract penalties.
When “it usually works” costs $50,000 every time it doesn’t, suddenly 95% reliability looks different. The casual shrug disappears. The math becomes compelling.
The Productivity Illusion
Modern development practices optimize for velocity. Ship fast. Iterate quickly. Move fast and break things. This philosophy has produced incredible innovation. It has also produced incredibly fragile systems.
There’s a productivity illusion at work here. When you measure output by features shipped, unreliable features look the same as reliable ones. The dashboard shows a completed ticket. The release notes list a new capability. Nobody mentions that it fails 5% of the time.
This creates perverse incentives. Engineers who spend time on reliability—error handling, edge cases, failure recovery—appear less productive than engineers who ship features quickly. The careful work is invisible. The fast work is celebrated.
Over time, teams accumulate systems that “mostly work.” Each one seemed fine in isolation. Together, they form a Rube Goldberg machine of accumulated fragility. One bad day reveals what was always true: the system was never as reliable as it appeared.
I’ve seen this pattern destroy teams. They ship features rapidly, earn praise, get promoted. Then the system collapses. The people who built the fragile foundations are long gone. The people left behind inherit a mess they didn’t create.
The Loss of Situational Awareness
Automation tools give us dashboards. Beautiful dashboards. Green circles and happy graphs. Everything looks fine until everything isn’t fine.
The problem with dashboards is that they show what they’re designed to show. They measure what someone decided to measure. They alert on what someone thought to alert on. Everything else is invisible.
This creates a false sense of situational awareness. You look at the dashboard, see green, and conclude that everything is working. But the dashboard isn’t showing the slow memory leak. It’s not showing the degrading response times. It’s not showing the connections that fail and retry successfully.
True situational awareness requires understanding the system at a level dashboards don’t provide. It requires knowing what questions to ask. It requires recognizing patterns that don’t trigger alerts. It requires the kind of deep familiarity that comes from building and breaking and rebuilding.
This familiarity is eroding. Teams rely on dashboards instead of developing it. They trust monitoring instead of understanding. When the dashboard shows green and the system is actually failing, they have no way to know.
Arthur just knocked my coffee cup dangerously close to the edge of the desk. He has situational awareness I lack. He knows exactly how far he can push things before consequences arrive. We should be so lucky with our software.
The Abstraction Tax
Every abstraction is a trade-off. You gain simplicity and lose visibility. You gain productivity and lose understanding. You gain speed and lose control.
Modern development stacks are abstractions all the way down. Frameworks abstract languages. Libraries abstract frameworks. Platforms abstract infrastructure. Each layer hides complexity that someone once had to understand.
This is necessary. No one person can understand everything anymore. The systems are too complex. The only way to build at scale is to stand on abstractions.
But abstractions lie. They present simple interfaces while hiding complex implementations. They promise reliability while delegating it elsewhere. They say “don’t worry about this” while problems accumulate underneath.
The abstraction tax comes due when you need to debug something the abstraction wasn’t designed to debug. When the problem is in the hidden layer. When the simple interface doesn’t expose the complex failure mode.
This is when you discover whether anyone on the team understands what’s actually happening. Often, no one does. The abstraction worked too well. It hid not just complexity but understanding.
The Reliability-Velocity Trade-off
Let me be clear about something: I’m not arguing against velocity. Fast iteration is genuinely valuable. The ability to ship quickly and learn from real users has transformed what’s possible in software.
But there’s a trade-off. Every hour spent on reliability is an hour not spent on features. Every edge case handled is a feature delayed. Every failure mode tested is a sprint point diverted.
Teams that ignore this trade-off end up with fragile systems. Teams that are paralyzed by it never ship anything. The skill is finding the right balance for your context.
What concerns me is that we’re losing the ability to even see the trade-off. Automation tools make unreliable code feel reliable. They mask the consequences of cutting corners. They defer the costs until later.
“Later” always arrives. The question is whether you’ve built the organizational capacity to handle it. Whether you’ve maintained the skills to diagnose problems. Whether you understand your systems well enough to fix them when they break.
Teams that automated everything and stopped paying attention often can’t answer these questions. They’re not negligent. They’re victims of tools that worked too well at hiding problems.
Generative Engine Optimization
This topic—the science of reliability—performs in a specific way within AI-driven search and summarization. AI systems can extract the concrete advice: measure failure rates, test recovery mechanisms, calculate costs. These actionable items surface well in summaries.
What gets lost is the nuance about skill erosion and automation complacency. An AI summarizing this article might produce bullet points about reliability metrics while missing the meta-point about how relying on AI systems to summarize reliability advice is itself a form of the problem I’m describing.
This matters because human judgment remains central to building reliable systems. Judgment about what to measure. Judgment about acceptable trade-offs. Judgment about when automated recovery is sufficient and when it’s masking fundamental problems.
AI tools can help with reliability work. They can analyze logs, identify patterns, suggest improvements. But they can’t replace the situated understanding that comes from building and breaking systems yourself. They can’t give you the intuition that says “something is wrong here” before the dashboards confirm it.
The automation-aware thinking I’m advocating throughout this article is becoming a meta-skill. It’s the ability to use automation while remaining aware of what it hides. To trust tools appropriately without trusting them blindly. To benefit from abstractions while understanding their limits.
For reliability specifically, this means using monitoring and automation while maintaining the skills to function without them. It means celebrating green dashboards while staying curious about what they might not show. It means embracing automation while remembering that “it usually works” is a warning, not a reassurance.
The Hidden Cost Calculation
Let me give you some numbers. These are composites from systems I’ve observed, anonymized but representative.
A system that “usually works” at 95% reliability, processing 10,000 transactions daily, experiences 500 failures per day. Each failure requires an average of 10 minutes of engineering time to investigate (even when automated recovery handles it, someone looks at the alert). That’s 83 hours of engineering time per week—two full-time engineers doing nothing but investigating “usually works” failures.
At fully-loaded engineering costs of $150/hour, that’s $12,450 weekly, or $647,400 annually. For a system that “usually works.”
Now improve reliability to 99%. Failures drop to 100 per day. Investigation time drops to 17 hours weekly. Cost drops to $130,000 annually. The difference—over half a million dollars—is the cost of “usually.”
But wait. The 99% system required two months of engineering effort to achieve. At $150/hour, that’s roughly $100,000. Payback period: under three months.
This math is obvious in retrospect. But it rarely happens because the cost of “usually works” is distributed and hidden. The engineering time is logged as “production support.” The customer complaints are absorbed by support teams. The reputation damage is gradual and hard to measure.
The reliability improvement, by contrast, requires concentrated investment with visible cost. It’s easy to defer. Next sprint. Next quarter. After this launch. The math says do it now, but the organizational incentives say wait.
The Long-Term Cognitive Consequences
I want to zoom out and consider the broader implications of what I’ve been describing. The tools that hide unreliability from us are also changing how we think about systems.
When every failure is automatically recovered, we stop expecting failures. When every error is handled behind the scenes, we stop thinking about errors. When every problem is someone else’s responsibility, we stop developing problem-solving skills.
This creates a generation of engineers who are productive but not resilient. They can build things quickly with modern tools. They cannot diagnose novel problems when those tools fail. They’ve never had to.
I’m not romanticizing the past. Building systems used to be harder, but that hardness wasn’t inherently valuable. Much of it was wasted effort that modern tools rightfully eliminate.
But some of the hardness was educational. Fighting with failure modes taught you what could go wrong. Debugging without sophisticated tools taught you how systems actually worked. Building without abstractions taught you what the abstractions were hiding.
The challenge is to maintain these skills without the suffering that used to teach them. To learn from failure without actually failing. To understand systems without building them from scratch.
Some teams do this deliberately. They hold chaos engineering exercises. They have on-call rotations that expose everyone to failures. They review incidents thoroughly and share learnings widely. They treat reliability knowledge as a cultural asset.
Most teams don’t. They trust the tools. They celebrate velocity. They say “it usually works” and move on to the next feature.
Building Truly Reliable Systems
Let me end with some practical advice for teams that want to move beyond “usually works.”
Invest in observability, not just monitoring. Monitoring tells you when things break. Observability tells you why. The difference matters when you’re trying to prevent the next failure rather than just recover from this one.
Embrace failure as information. Every failure, even automatically recovered ones, contains signal about your system. Collect these signals. Analyze them. Use them to find patterns that predict bigger failures.
Test your assumptions. If you believe a system is 99% reliable, prove it. Measure actual failure rates. Compare them to your assumptions. Be willing to discover you were wrong.
Practice failure recovery. Don’t wait for real incidents to test your recovery procedures. Run drills. Break things deliberately. Find out what works and what doesn’t before it matters.
Maintain skill diversity. Not everyone needs to be a reliability expert. But someone on the team should understand systems at a level below the abstractions. Someone should be able to debug without the tools.
Quantify the cost of unreliability. Make the trade-offs visible. Show leadership what “usually works” actually costs. Convert reliability into language that resonates beyond engineering.
Fight the complacency spiral. Recognize that better automation can lead to less attention, which can lead to worse outcomes. Design processes that counteract this tendency.
The Uncomfortable Reality
Here’s the uncomfortable truth: most systems that “usually work” will continue to “usually work” for a long time. The catastrophic failures I’ve been warning about don’t happen every day. They might not happen for years.
This makes reliability work feel optional. The cost of unreliability is real but distributed across time. The cost of reliability improvement is real and immediate. In the short term, cutting corners feels rational.
But time passes. Systems age. Dependencies change. Traffic grows. Teams turn over. And eventually, the accumulated fragility exceeds what automation can hide. The “usually” stops being good enough.
Teams that invested in reliability handle this transition gracefully. They understand their systems. They can diagnose novel problems. They have the skills and the culture to respond effectively.
Teams that didn’t invest struggle. They call in consultants. They throw resources at symptoms without understanding causes. They discover that the knowledge they needed was never developed, and there’s no quick way to acquire it.
Arthur has fallen asleep on my desk, completely unconcerned about system reliability. He has the luxury of not caring because someone else maintains the systems that feed him and keep him warm.
We don’t have that luxury. Our systems require active maintenance by people who understand them. And every time we say “it usually works” and move on, we’re betting that someone will still understand when the bill comes due.
Make sure that someone exists. Make sure that someone is you, or someone on your team, or someone you can call. Because “it usually works” is a debt, and debts always get collected.













