When a public cloud goes down, we all react in surprise. The internet fills with memes, hot takes and “this is why we’re multicloud” posts faster than you can say “status page update.” But deep down, we know the truth — even the most significant clouds aren’t immune to bad days.
The recent AWS outage wasn’t a shock. It was a reminder that no matter how elastic, scalable and regionally redundant your setup claims to be, somewhere beneath all that abstraction, there are still servers, switches and human hands pushing updates.
Outages happen, and when they do, the question isn’t why — it’s how ready are you?
We prefer not to think about that aspect. We architect for performance and budget for growth. But what about resilience? That’s the part we tend to push down the backlog, somewhere between “optimise storage costs” and “finally fix that alerting script.”
Because the cloud’s always up, right? Until it isn’t.
Resilience isn’t redundancy: it’s design
Redundancy is having two of everything. It’s knowing that when one fails, your business doesn’t. That’s a subtle but critical difference.
You can run multi-AZ, multi-region and even multicloud deployments, but if your failover plan exists only in a slide deck, you’re not resilient; you’re hopeful. And hope isn’t a strategy.
True resilience means designing like the cloud will fail. It means asking uncomfortable questions like:
- Can we run if our primary provider disappears?
- Can we spin up our critical workloads elsewhere, or are they tied to a specific ecosystem?
- Do we actually test recovery, or do we merely claim to do so?
That’s where architecture matters. It’s not about chasing 100% uptime. It’s about engineering the ability to keep going when things go wrong.
Resilience is messy, but it’s worth it
Here’s the hard part: resilience doesn’t come in a box. It’s not a checkbox in your cloud console. It’s layers of thought, design and discipline. It’s data replication, isolated recovery environments, hybrid strategies and yes, sometimes even good old-fashioned private infrastructure.
Because resilience isn’t about avoiding the cloud or depending on anything to stay perfect forever.
When AWS (or Azure, or Google or anyone else) stumbles, the businesses that stay online aren’t the ones with the most significant cloud bills; they’re the ones that are built for continuity.
Architecting for resilience isn’t something you can toggle on. It’s about designing layers — physical, logical and operational — that can withstand a hit and continue to function.
Start with data because it’s the hardest thing to replace. Replicate it intelligently, across regions or platforms, and understand the trade-offs between consistency, latency and cost.
Then, examine the applications. Can they be restarted elsewhere without needing to rewrite everything? That’s where containerisation, orchestration and good dependency management earn their keep.
Then comes infrastructure, and not just having a second cloud account but genuinely having a recovery path. That could mean a secondary site, a private cloud or a completely isolated recovery environment. The goal isn’t perfect mirroring; it’s survivability. If your architecture can’t run at reduced capacity while things recover, you’ve built convenience, not resilience.
And finally, consider people and process. You can have the best failover tech in the world, but if no one knows when or how to use it, it’s theatre, not strategy. Practice it, automate it and break things on purpose occasionally. Real resilience is learned through controlled chaos.
The takeaway
Every outage is a reminder that resilience isn’t optional. It’s a competitive advantage and the quiet confidence that when the internet catches fire, your business continues to serve customers as if nothing happened.
It’s fine if your most critical applications live in the cloud. Just make sure they’re not living only there. Resilience isn’t about trusting the cloud less. It’s about trusting your architecture more.
Topics: