When the Cloud Breaks

Survival guide for the next blackout, focusing on resilience, not expensive insurance.

If 2025 taught us anything, it’s that the Cloud is not a magic fortress. Between the GCP Service Control crash in June, the AWS US-EAST-1 meltdown in October, and the Cloudflare config error in November, we learned a hard lesson: the internet is not as decentralized as we think.

When the dust settled, the companies that fared best weren’t necessarily the ones who stayed online. They were the ones who managed the aftermath effectively.

The knee-jerk reaction from every board of directors lately is: We need Multi-Cloud! True multi-cloud is financial and operational sinkhole. It double your engineering complexity and triples your data egress fees. Deepening resilience within a single ecosystem is often a more practical investment than spreading resources thin across two.

The Lifeboat Concept

During the October AWS outage, dynamic applications became completely unreachable. A lifeboat strategy offers a middle ground. By hosting a static, read-only version of critical landing pages on independent infrastructure like a simple storage bucket or edge network, a business can remain visible. When the main application fails, traffic shifts to the static site. Users see a branded maintenance message rather than an error code, preserving trust while the engineering team works on a fix.

Managing Retry Storms

The Cloudflare outage in November was complicated by the sheer volume of traffic generated by applications trying to reconnect. When a service fails, automated retry logic accidentally create a self-inflicted DDoS attack, hammering servers the moment they try to recover.

The prevent those retry storms, you need to implement circuit breakers. This pattern detects when a downstream service is failing and temporarily trips the circuit to stop outgoing requests.

Review your retry logic. If your database connection fails, do not retry immediately. Wait 100ms, then 200ms, then 400ms, and add a random “jitter” (e.g., 403ms). This spreads the load and gives the upstream service breathing room to recover.

Summary

100% uptime is an elusive goal. A resilent strategy accepts that failure is a possibility and focuses on controlling the experience when it happens. A “good failure” is transparent, protects data, and communicate clearly.

References:

Back to Articles