Infrastructure That Works vs Infrastructure You Can Actually Run

Feb 6
3 min read

Updated: Feb 9

Most infrastructure looks fine on paper.

It deploys. It scales. It passes reviews. It follows whatever the current best practices are supposed to be. None of that means it's operable.

There is a difference between infrastructure that technically works and infrastructure that a team can reliably run over time. Most systems fail in that gap.

“It Works” Is a Low Bar

A system that works usually means:

Requests complete successfully
Deployments finish without errors
Monitoring dashboards show green most of the time

This is table stakes. Cloud platforms are very good at making this easy.

What these systems rarely account for is what happens when:

The system behaves in a way nobody predicted
The person who built it is unavailable
An assumption made two years ago is no longer true

That's where most outages start.

Operability Is Not a Feature

Operability is not observability tooling. It's not alerts. It's not runbooks that no one reads.

Operability is the accumulated result of dozens of small decisions:

How environments differ, and why
What's automated, and what's still manual
Which failures are expected, and which are treated as surprises
Who is allowed to change what, and under what conditions

If these decisions are implicit, the system will eventually become unstable, even if the infrastructure itself is “correct.”

Ownership Is the First Failure Mode

In many systems, nobody clearly owns:

Incident response
Capacity decisions
Deployment safety
Cost trade-offs
Long-term maintenance

Ownership is often distributed informally, or assumed to be shared. In practice, that means it's owned by whoever happens to be available when something breaks.

This works until it doesn't.

A system with unclear ownership cannot be reliably operated, regardless of how modern the stack is.

Complexity Accumulates Faster Than Load

Most teams worry about scaling traffic. Fewer teams measure how fast their systems are becoming harder to understand.

Complexity increases through:

Layered abstractions with overlapping responsibilities
Multiple deployment paths
Conditional behavior based on environment or account
Tooling added to compensate for earlier decisions

None of these are inherently wrong. The problem is when they are added without removing anything.

A system can handle 10x traffic and still be less reliable because nobody fully understands its behaviour anymore.

“We’ll Fix It Later” Usually Means Never

Deferred operational work almost always becomes permanent. Things that are commonly postponed:

Clarifying failure modes
Simplifying deployment paths
Removing unused components
Documenting non-obvious decisions

These tasks are not urgent until the system is under stress. When that happens, they are much harder to do safely. Most systems are not fragile because they are old. They are fragile because the cost of understanding them has quietly grown too high.

If You Can't Explain Recovery, You Are Guessing

A useful test is simple:

How does this system fail?
How do we know it's failing?
How do we recover, step by step?
How long does that take, realistically?

If the answers vary depending on who you ask, the system is being run on institutional memory, not engineering discipline.

That is fine for a while. It does not scale.

What This Actually Requires

Infrastructure that can be run over time usually has:

Fewer moving parts than the design originally allowed
Explicit ownership boundaries
Boring, repeatable operational paths
Decisions written down while they are still fresh
A bias toward removing things, not adding them

None of this is glamorous. Most of it does not show up in demos or architecture diagrams.

It's still the difference between a system that survives change and one that fails under it.

TL;DR

If a system only works because the people who built it are still around, it's not stable. If ownership, recovery, and failure modes are not explicit, the system is being held together by memory.

Most infrastructure problems are not caused by scale or cloud platforms, they are caused by accumulated, undocumented decisions that make systems hard to run.