When the cloud goes dark: what every IT leader should have ready before the next outage
cloudcloud application platform
28 October 2025
Share
A large cloud outage is never just a technical issue. It's a revenue issue, a reputation issue, and an additional workload for already stretched teams. On Monday, October 20, 2025, a global hyperscaler experienced a major incident that disrupted many internet services for hours, with recovery progressing throughout the day.¹² The episode reminded everyone that even world-class platforms can have bad days; therefore, continuity plans must account for real dependencies across identity, DNS, networking, and third-party APIs.³
Our goal here is to clarify what business continuity means in a cloud-first world, why portability matters, and how to prepare realistic recovery paths when a region experiences a major incident.
Why outages still happen today
Complexity drives risk. The Uptime Institute’s latest analysis notes that while overall outage frequency and severity have trended down, modern architectures introduce new failure modes that operators must actively manage.⁴⁵ Within those incidents, IT and networking causes are a meaningful share and can create cross-provider ripple effects that make headlines.⁶ You cannot eliminate outages in a distributed, API-driven world. You can reduce blast radius, shorten recovery, and maintain business operations by assuming components will fail and designing your application platform to adapt.
The domino effect of downtime
Revenue loss: Outages are expensive. Uptime Intelligence reports that 54 percent of respondents said their most recent significant outage cost more than $100,000, and about one in five reported costs above $1 million.⁷⁸
Reputation damage: Customers may forgive an outage, but repeated incidents shape brand perception long after services are restored.
Team load: Incidents consume senior engineering attention, slow delivery, and create follow-on risk from rushed remediations.
Security exposure: Crisis conditions increase the likelihood of configuration mistakes. IBM’s 2025 data shows the average global cost of a data breach at $4.44 million, underscoring the material impact when incidents overlap.⁹
What your CEO and board want to hear when your cloud platform fails
We have a current, tested continuity plan. It names owners, playbooks, and decision thresholds. It covers failure of identity, DNS, CDN, data stores, and CI systems, not just a cloud provider. NIST SP 800-34 offers a dependable framework for plan structure, roles, and exercises.¹⁰¹¹
We can run the business in a degraded state. We know which services can operate read-only, what features we can shed, and what SLAs we can meet.
Our platform emphasises region choice and portability. This is not a promise of seamless failover. It is an operational choice that supports disaster recovery, sovereignty, and negotiating position. Gartner identifies multi-cloud and digital sovereignty as key trends guiding cloud strategies.¹²
We measure resilience work like any other investment. We track recovery performance against internal objectives, dependency count, and change failure rate. We report on incident causes and improvements in recovery time over time.
A resilience checklist for cloud-first teams
1) Map and minimise critical dependencies
Identify single points of failure across identity, DNS, certificate issuance, artifact registries, object storage, and message queues.
Dual-home is sensible. Secondary DNS, alternative artifact mirrors, cross-region object replication, and a backup identity assertion path for break-glass access.
Document third-party APIs that are operationally critical and define fallbacks or feature flags for graceful degradation.
2) Classify services by criticality and failure mode
For each service, document internal recovery objectives, including target time to restore and acceptable data loss, the acceptable downgrade modes, and the locations where it can run.
Prioritise customer-facing pathways that drive cash flow. Decouple analytics and back-office workloads from the hot path whenever possible.
3) Practice game days, not just DR tests
Move beyond scripted restore tests. Inject real fault types such as DNS failures, expired certificates, stalled CI runners, and partial storage unavailability.
Include executive stakeholders. Practice status updates, customer communications, and vendor escalations in a single exercise.
4) Treat data as the contract
Standardise backup and cloning policies with sanitisation. Guarantee a clean, time-bounded dataset for testing and recovery.
Keep data portability top of mind. If your datastore is managed, ensure you can export, rehydrate, and run it elsewhere when needed.
5) Bake resilience into delivery
Every change should be deployable with health checks, traffic shifting, and instant rollback.
“Everything as code” is not a slogan. Define networking, policies, and services declaratively so you can reconstruct environments on demand.
How multi-cloud fits without overreach
Multi-cloud is a strategy for choice and portability, not a promise of seamless failover. The aim is to reduce correlated risk and retain the option to restore service at another location when needed. Treat it as an enabler for disaster recovery plans and region placement, rather than a guarantee of lower downtime by default.¹²
Use a tiered approach:
Tier 1 (critical pathways): Engineer for fast detection and operator-led restoration. Maintain tested playbooks for DNS and identity changes, and ensure data and images can be rehydrated elsewhere.
Tier 2 (important but not cash-path): Achieve cross-region resilience within a single provider, and keep portability artifacts current so you can rebuild in a different location if required.
Tier 3 (internal and analytics): Optimise for cost and simplicity with scheduled backups and a longer recovery window based on internal objectives.
Keep complexity proportional to value. Focus on portability and documented procedures that your team can execute under pressure.
What “designing for failure” looks like on Upsun
Upsun helps enterprises make restoration predictable and repeatable. It is not an automated cross-region or cross-cloud failover system. Instead, it gives you the consistency and controls to execute your business continuity and disaster recovery plans.
Git-driven, YAML-based configuration: Define services and routing declaratively so you can rebuild environments from a clean Git checkout. See the Upsun platform overview and the documentation.
Automatic preview environments per branch: Spin up production-like test environments to rehearse restoration steps, validate feature flags, and exercise dependency changes without risk. Explore developer resources.
Instant data cloning with sanitization: Create safe, representative datasets for game days and restore tests.
Multi-service orchestration: Run heterogeneous stacks with consistent rules so services come back as a unit during restoration.
Observability and APM: Centralise metrics, traces, and logs to speed detection and confirm recovery against internal objectives.
Portability and region choice: Maintain portability across supported clouds and locations, including data sovereignty needs. Restoration is initiated and controlled by your team following your playbooks.
Important: Upsun does not perform automated failover across regions or clouds; continuity is achieved through planned restoration procedures initiated by your operators.
A practical 30-day continuity plan
Even if your destination is a broader multi-cloud architecture, you can materially improve resilience in the next month.
Week 1: Baseline and prioritise
Build a current dependency map. Note identity provider, DNS, CDN, and critical third-party APIs.
Define internal recovery objectives for the top five customer-facing services, including target time to restore and acceptable data loss.
Pick one critical user journey and define a degraded mode.
Week 2: Prove portability
Build and document a clean restore path to a secondary region or data centre.
Export and rehydrate the primary database into the secondary target.
Capture every step in code or scripts and commit to Git.
Week 3: Exercise restoration
Run a disaster recovery exercise that simulates a provider region outage. Practise DNS updates, emergency identity access, and read-only mode while you execute the restoration.
Measure time to detect, decide, and restore. Identify where automation reduces manual steps.
Week 4: Automate and communicate
Automate the environment build from Git via a single YAML config, including networking and policies.
Draft customer and internal communication templates for incidents.
Brief the board: present today’s baseline, the measured game day results, and the 90-day roadmap for portability and testing cadence.
If you use Upsun, most of these steps map directly to platform features: declarative config, branch-based previews, instant database cloning with sanitisation, and multi-service orchestration. If you are building in-house, focus on achieving parity in the narrow areas that yield the most risk reduction.
Talking to stakeholders without assigning blame
When an incident originates with a cloud provider, resist the urge to assign public blame. Emphasise:
Our platform supports region choice and portability. We have tested restoration procedures and documented playbooks.
Our platform is designed for provider incidents. We invest in resilience that assumes software and networks sometimes fail.
We follow industry guidance. We structure plans, exercises, and metrics consistent with NIST recommendations and analyst trends.¹⁰¹²
We do not present multi-cloud as an automated failover. We use it to keep our options open and to make restoration predictable.
What to measure next quarter
Recovery performance for Tier 1 services. Are actual restore times and data loss within our internal objectives.
Change failure rate and mean time to restore. Resilience and delivery quality travel together.
Dependency counts on the hot path. Fewer is better.
Portability checkpoints. Can we recreate the app in another region or provider from a clean Git checkout and a single config file.
Restoration drill scorecard. Track steps completed from Git, time for data rehydration, and on-call team workload during drills.
Cost of resilience. Track spend on redundancy and game days against avoided incident hours and reduced business impact.
Cloud outage, business continuity, and multi-cloud strategy
If your board is asking for an updated continuity position after a high-profile outage, anchor the conversation on three points:
Design for failure, not for perfection. Set internal recovery objectives for each service, including target time to restore and acceptable data loss. Practise them with disaster recovery exercises.
Portability is preparation. Keep the ability to rebuild in another location documented, scripted, and rehearsed.
Platforms can help. Choose tools that standardise environments and reduce manual steps during restoration. Upsun’s Git-driven config, previews, data cloning with sanitisation, orchestration, and observability exist to make your plan executable in practice.
The lesson from that Monday, October 20, 2025, is not that a specific provider failed. It is that the internet is a system of systems, and no component is immune to disruption. The right response is a sober, well-communicated plan that designs for failure, practices restoration, and uses the right platform abstractions to make resilience routine. That is how you protect revenue, reputation, and your team’s focus when the cloud goes dark.