When the cloud goes dark: what every IT leader should have ready before the next outage

cloudcloud application platform

Updated: 08 May 2026

This post is also available in French and in German.

A large cloud outage is never just a technical issue. It's a revenue issue, a reputation issue, and an additional workload for already stretched teams. The major incident that disrupted a global hyperscaler in October 2025, taking down a wide range of internet services for hours, is a useful case study in how quickly failure cascades across identity, DNS, networking, and third-party APIs. It reminded everyone that even world-class platforms can have bad days, and that continuity plans must account for the full web of real dependencies, not just the primary provider.

Our goal here is to clarify what business continuity means in a cloud-first world, why portability matters, and how to prepare realistic recovery paths when a region experiences a major incident.

Why outages still happen today

Complexity drives risk. The Uptime Institute’s analysis notes that while overall outage frequency and severity have trended down, modern architectures introduce new failure modes that operators must actively manage. Within those incidents, IT and networking causes are a meaningful share and can create cross-provider ripple effects that make headlines. You cannot eliminate outages in a distributed, API-driven world. You can reduce blast radius, shorten recovery, and maintain business operations by assuming components will fail and designing your application platform to adapt.

The domino effect of downtime

The costs of unplanned downtime compound in ways that rarely appear on a single dashboard. Financially, the exposure is significant: Uptime Intelligence reports that 54 percent of respondents said their most recent significant outage cost more than $100,000, and about one in five reported costs above $1 million.

Reputation damage accumulates more slowly but lasts longer. Customers may forgive an isolated incident, but repeated outages shape brand perception well after services are restored. Meanwhile, the incident itself consumes exactly the senior engineering attention that should be focused on delivery, and rushed remediations create follow-on risk.

Overlapping with a security incident makes the picture worse still: IBM's 2025 data puts the average global cost of a data breach at $4.44 million, a figure that underscores how quickly a crisis condition can become a material financial event.

What your CEO and board want to hear when your cloud platform fails

When an outage begins, leadership needs to hear four things:

First, that you have a current, tested continuity plan: one that names owners, playbooks, and decision thresholds, and that covers failure of identity, DNS, CDN, data stores, and CI systems, not just a cloud provider. NIST SP 800-34 offers a dependable framework for plan structure, roles, and exercises.

Second, that you can run the business in a degraded state, knowing which services can operate read-only, what features can be shed, and what SLAs you can meet.

Third, that your platform emphasizes region choice and portability, not as a promise of seamless failover, but as an operational choice that supports disaster recovery and sovereignty.

And fourth, that you measure resilience work like any other investment: tracking recovery performance against internal objectives, dependency count, and change failure rate, and reporting on incident causes and improvements in recovery time over time.

A resilience checklist for cloud-first teams

1) Map and minimize critical dependencies

Identify single points of failure across identity, DNS, certificate issuance, artifact registries, object storage, and message queues. Secondary DNS, alternative artifact mirrors, cross-region object replication, and a backup identity assertion path for break-glass access are sensible starting points. Document third-party APIs that are operationally critical and define fallbacks or feature flags for graceful degradation.

2) Classify services by criticality and failure mode

For each service, document internal recovery objectives, including target time to restore and acceptable data loss, alongside acceptable downgrade modes and the locations where it can run. Prioritise customer-facing pathways that drive cash flow, and decouple analytics and back-office workloads from the hot path whenever possible.

3) Practice game days, not just DR tests

Move beyond scripted restore tests. Inject real fault types such as DNS failures, expired certificates, stalled CI runners, and partial storage unavailability. Include executive stakeholders and practise status updates, customer communications, and vendor escalations in a single exercise.

4) Treat data as the contract

Standardise backup and restore procedures with sanitisation to guarantee a clean, time-bounded dataset for testing and recovery. Keep data portability top of mind: if your datastore is managed, ensure you can restore it and run it elsewhere when needed.

5) Bake resilience into delivery

Every change should be deployable with health checks, traffic shifting, and instant rollback. "Everything as code" is not a slogan: define networking and services declaratively so you can reconstruct environments on demand.

How cloud portability and region choice fits without overreach

Cloud portability is a strategy for choice and operational resilience, not a promise of seamless failover. The aim is to reduce correlated risk and retain the option to restore service at another location when needed—treating it as an enabler for disaster recovery plans and region placement, rather than a guarantee of lower downtime by default. Gartner identifies digital sovereignty and strategic flexibility as key trends guiding cloud decisions, and portability is central to both.

Use a tiered approach:

Tier 1 (critical pathways): Engineer for fast detection and operator-led restoration. Maintain tested playbooks for DNS and identity changes, and ensure data and images can be restored elsewhere.
Tier 2 (important but not cash-path): Achieve cross-region resilience within your provider, and keep portability artifacts current so you can rebuild in a different location if required.
Tier 3 (internal and analytics): Optimize for cost and simplicity with scheduled backups and a longer recovery window based on internal objectives.

Keep complexity proportional to value. Focus on portability and documented procedures that your team can execute under pressure.

What “designing for failure” looks like on Upsun

Upsun helps enterprises make restoration predictable and repeatable. It is not an automated cross-region failover system. Instead, it gives you the consistency and controls to execute your business continuity and disaster recovery plans.

Git-driven, YAML-based configuration: Define services and routing declaratively so you can rebuild environments from a clean Git checkout. See the Upsun platform overview and the documentation.
Automatic preview environments per branch: Spin up production-like test environments to rehearse restoration steps, validate feature flags, and exercise dependency changes without risk. Explore developer resources.
Managed backup and restore with sanitization: Create safe, representative datasets for game days and restore tests by cloning environments directly through the platform, no manual export steps required.
Multi-service orchestration: Run heterogeneous stacks with consistent rules so services come back as a unit during restoration.
Observability and APM: Centralize metrics, traces, and logs to speed detection and confirm recovery against internal objectives.
Region choice: Choose from supported cloud regions to meet data sovereignty and disaster recovery needs. Restoration is initiated and controlled by your team following your playbooks.

Note: Upsun does not perform automated failover across regions or clouds; continuity is achieved through planned restoration procedures initiated by your operators.

A practical 30-day continuity plan

Even if your destination is a broader cloud portability strategy, you can materially improve resilience in the next month.

Week 1: Baseline and prioritise

Build a current dependency map, noting identity provider, DNS, CDN, and critical third-party APIs. Define internal recovery objectives for the top five customer-facing services, including target time to restore and acceptable data loss. Pick one critical user journey and define a degraded mode.

Week 2: Prove portability

Build and document a clean restore path to a secondary region or data centre. Restore the primary database into the secondary target and validate it. Capture every step in code or scripts and commit to Git.

Week 3: Exercise restoration

Run a disaster recovery exercise that simulates a provider region outage. Practise DNS updates, emergency identity access, and read-only mode while you execute the restoration. Measure time to detect, decide, and restore, and identify where automation reduces manual steps.

Week 4: Automate and communicate

Automate the environment build from Git via a single YAML config, including networking and service definitions. Draft customer and internal communication templates for incidents. Brief the board: present today's baseline, the measured game day results, and the 90-day roadmap for portability and testing cadence.

If you use Upsun, most of these steps map directly to platform features: declarative config, branch-based previews, managed backup and restore with sanitisation, and multi-service orchestration. If you are building in-house, focus on achieving parity in the narrow areas that yield the most risk reduction.

Talking to stakeholders without assigning blame

When an incident originates with a cloud provider, resist the urge to assign public blame. Emphasise that your platform supports region choice and portability, that you have tested restoration procedures and documented playbooks, and that your resilience investment assumes software and networks sometimes fail.

You follow industry guidance, structuring plans, exercises, and metrics consistent with NIST recommendations. You use region choice and portability to keep your options open and to make restoration predictable; not as a substitute for careful planning.

What to measure next quarter

The metrics that matter most are the ones that reveal whether your plans hold up under real conditions.

Track recovery performance for Tier 1 services against your internal objectives for restore time and acceptable data loss. Monitor change failure rate and mean time to restore. Delivery quality and resilience tend to move together. Count dependencies on the hot path, since fewer is better and the trend matters as much as the number.

Run regular portability checkpoints to confirm you can recreate the application in another region from a clean Git checkout and a single config file. Score each restoration drill on steps completed from Git, time to restore data, and on-call team workload. And finally, track the cost of resilience: spend on redundancy and game days against avoided incident hours and reduced business impact.

Cloud outage, business continuity, and cloud portability

If your board is asking for an updated continuity position after a high-profile outage, anchor the conversation on three points:

Design for failure, not for perfection. Set internal recovery objectives for each service, including target time to restore and acceptable data loss. Practise them with disaster recovery exercises.
Portability is preparation. Keep the ability to rebuild in another location documented, scripted, and rehearsed.
Platforms can help. Choose tools that standardize environments and reduce manual steps during restoration. Upsun's Git-driven config, previews, managed backup and restore with sanitisation, orchestration, and observability exist to make your plan executable in practice