top of page

Infrastructure That Works vs Infrastructure You Can Actually Run

  • Feb 6
  • 3 min read

Updated: Feb 9


Most infrastructure looks fine on paper.


It deploys. It scales. It passes reviews. It follows whatever the current best practices are supposed to be. None of that means it's operable.


There is a difference between infrastructure that technically works and infrastructure that a team can reliably run over time. Most systems fail in that gap.


“It Works” Is a Low Bar 


A system that works usually means:


  • Requests complete successfully

  • Deployments finish without errors

  • Monitoring dashboards show green most of the time

This is table stakes. Cloud platforms are very good at making this easy.


What these systems rarely account for is what happens when:


  • The system behaves in a way nobody predicted

  • The person who built it is unavailable

  • An assumption made two years ago is no longer true

That's where most outages start.


Operability Is Not a Feature 


Operability is not observability tooling. It's not alerts. It's not runbooks that no one reads.


Operability is the accumulated result of dozens of small decisions:

  • How environments differ, and why

  • What's automated, and what's still manual

  • Which failures are expected, and which are treated as surprises

  • Who is allowed to change what, and under what conditions

If these decisions are implicit, the system will eventually become unstable, even if the infrastructure itself is “correct.”


Ownership Is the First Failure Mode 


In many systems, nobody clearly owns:

  • Incident response

  • Capacity decisions

  • Deployment safety

  • Cost trade-offs

  • Long-term maintenance

Ownership is often distributed informally, or assumed to be shared. In practice, that means it's owned by whoever happens to be available when something breaks.


This works until it doesn't.


A system with unclear ownership cannot be reliably operated, regardless of how modern the stack is.


Complexity Accumulates Faster Than Load 


Most teams worry about scaling traffic. Fewer teams measure how fast their systems are becoming harder to understand.


Complexity increases through:

  • Layered abstractions with overlapping responsibilities

  • Multiple deployment paths

  • Conditional behavior based on environment or account

  • Tooling added to compensate for earlier decisions


None of these are inherently wrong. The problem is when they are added without removing anything.


A system can handle 10x traffic and still be less reliable because nobody fully understands its behaviour anymore.


“We’ll Fix It Later” Usually Means Never 


Deferred operational work almost always becomes permanent. Things that are commonly postponed:

  • Clarifying failure modes

  • Simplifying deployment paths

  • Removing unused components

  • Documenting non-obvious decisions

These tasks are not urgent until the system is under stress. When that happens, they are much harder to do safely. Most systems are not fragile because they are old. They are fragile because the cost of understanding them has quietly grown too high.


If You Can't Explain Recovery, You Are Guessing


A useful test is simple:


  • How does this system fail?

  • How do we know it's failing?

  • How do we recover, step by step?

  • How long does that take, realistically?


If the answers vary depending on who you ask, the system is being run on institutional memory, not engineering discipline.


That is fine for a while. It does not scale.


What This Actually Requires 


Infrastructure that can be run over time usually has:


  • Fewer moving parts than the design originally allowed

  • Explicit ownership boundaries

  • Boring, repeatable operational paths

  • Decisions written down while they are still fresh

  • A bias toward removing things, not adding them


None of this is glamorous. Most of it does not show up in demos or architecture diagrams.


It's still the difference between a system that survives change and one that fails under it.


TL;DR 


If a system only works because the people who built it are still around, it's not stable. If ownership, recovery, and failure modes are not explicit, the system is being held together by memory.


Most infrastructure problems are not caused by scale or cloud platforms, they are caused by accumulated, undocumented decisions that make systems hard to run.

 
 
bottom of page