• Formerly Platform.sh
  • Contact us
  • Docs
  • Login
Watch a demoFree trial
Blog
Blog
BlogProductCase studiesCompany news
Blog

Multi-cloud made simple: a practical guide to reducing risk without adding complexity

multi-appcloudcloud application platformcost savings
28 October 2025
Share

On Monday, October 20, 2025, a global hyperscaler experienced a major incident disrupting many internet services for hours, with recovery progressing throughout the day.¹² It was a reminder that even world-class platforms can have bad days and that continuity plans must account for real dependencies across identity, DNS, networking, and third-party APIs.³ This piece is the practical follow-on to our article about When the cloud goes dark: what every IT leader should have ready before the next outage. It is written for CIOs and CTOs who now need a concrete plan to reduce risk without inflating operating cost or complexity.

Expectation setting: Upsun’s multicloud story is about smart initial region choice, portability, and tested business continuity and disaster recovery. Our value is in making restoration predictable and repeatable.

Who this guide is for and what you will deliver

If you lead platform, infrastructure, or application operations and you must brief your board on a credible multicloud strategy, this guide gives you:

  • A step-by-step plan to achieve portability without tooling sprawl.
  • A clear governance model that travels with your app.
  • An implementation path on a cloud application platform, such as Upsun.
  • Metrics and artefacts you will deliver in 30, 60, and 90 days.

Analyst guidance continues to emphasise distributed cloud, portability, and digital sovereignty for I and O leaders.⁴ Uptime Institute’s research shows overall outage trends improving, yet complex IT and networking issues remain an impactful share of incidents.⁵⁶ You cannot eliminate outages, but you can reduce correlated risk and shorten restoration with disciplined preparation.⁵⁶

The multicloud strategy

Multicloud is a strategy for choice and portability, not a promise of seamless failover. Treat it as an enabler for disaster recovery, sovereignty, and negotiating position.⁴ The operating principle is simple: accept a non-zero RTO for severe region events, then engineer for fast detection, clean restoration, and consistent governance.

Step-by-step plan: 30, 60, 90 days

Day 0 to 30: make restoration executable

Outcome by Day 30: a tested restoration path for one Tier 1 service, with artefacts that any on-call leader can run.

  1. Pick one critical user journey and map dependencies. Include identity, DNS, CDN, and operationally critical third-party APIs.
  2. Set RTO and RPO targets for the journey. Document downgrade modes you will use during restoration.
  3. Establish a clean restore target. Choose a secondary region or data centre aligned with sovereignty requirements.⁴
  4. Export and rehydrate data. Prove that today’s database can be restored and started in the target. Record time to fetch, rehydrate, and validate.
  5. Capture everything in Git. Declare services, routing, policies, and scaling in a single config.
  6. Run a game day. Simulate a provider-region incident, update DNS, use break-glass identity, and execute the restoration while operating in read-only. Measure time to detect, decide, and restore. Use NIST SP 800-34 as the structure for roles and decision thresholds.⁷⁸

Day 31 to 60: standardise and expand

Outcome by Day 60: repeatable playbooks for two more services, policy-as-code guardrails, and a shared observability vocabulary.

  1. Add two Tier 2 services. Achieve cross-region resilience within your primary provider while keeping portability artefacts current.
  2. Policy as code. Express network policy, data retention, backup cadence, and sanitisation as reusable modules.
  3. Shared observability. Define a common golden signals dashboard for restore drills. This accelerates detection and decision time during incidents.
  4. Financial operations hygiene. Forecast the cost of restoration tests and steady-state backups. Tie spend to avoid incident hours, not only raw line items.

Day 61 to 90: Industrialise

Outcome by Day 90: one-button restore pipeline from a clean Git checkout, quarterly drill cadence, and a board-ready report.

  1. Automate environment build from Git: One pipeline that rebuilds networking, policies, and services in the target.
  2. Quarterly drills: Schedule operator-led restoration tests for Tier 1 and Tier 2 services.

Executive reporting: Track RTO, RPO, dependency count, change failure rate, and drill results each quarter. IBM’s 2025 data places the average global breach cost at 4.44 million dollars, reinforcing why disciplined resilience work matters when incidents overlap.⁹

How to implement this on Upsun

Upsun is a multicloud application platform that helps you standardise delivery and make restoration predictable. It is not an automated cross-region failover system. Instead, it gives teams the building blocks to execute BCP and DR with confidence.

1) Connect Git and declare your app

Use a single YAML to define services, routes, policies, and scaling. Commit it alongside your code so environments can be rebuilt from a clean checkout. Read the Upsun overview and docs.

2) Create automatic preview environments per branch

Spin up production-like environments for each branch to rehearse restoration steps, validate feature flags, and exercise dependency changes safely. Explore developer resources.

3) Clone data with sanitisation

Use instant data cloning to build representative test datasets while protecting sensitive information. This turns drills from theory into practice.

4) Orchestrate multi-service apps as a unit

Define dependencies once and let the platform manage start order, health checks, routing, and scale consistently across supported providers. This reduces snowflake runbooks during stressful moments.

5) Observe once, act faster

Centralise metrics, traces, and logs so the same dashboards apply in primary and restoration targets. This shortens detection and decision time during incidents.

6) See cost across providers

Use one control plane to view utilisation and forecast spend across clouds. This improves governance without forcing you to stitch reports.

What this means for an IaaS region outage: if the hosting region for an Upsun cloud region suffers a severe incident, you would initiate a documented restoration into a different data centre, subject to provider conditions. There is downtime during this process. Your Upsun config, preview environments, data cloning, and orchestration make that restoration predictable.

Multicloud strategy without overreach

Apply a tiered model

  • Tier 1: critical cash-path services. Engineer for fast detection and operator-led restoration. Keep tested playbooks for DNS and identity changes. Ensure data, images, and config are ready to rehydrate in the secondary target.
  • Tier 2: important but not cash-path. Achieve cross-region resilience within one provider. Keep portability artefacts current so you can rebuild elsewhere if needed.
  • Tier 3: internal and analytics. Optimise for cost with disciplined backups and a longer RTO.

Automated failover across regions or providers is complex and expensive. Many enterprises adopt a non-zero RTO with tested restores that fit risk tolerance and budget. This aligns with current analyst emphasis on distributed cloud and portability.⁴

Governance that travels with your app

  • Policy as code: Declare network rules, retention, cloning, and secrets handling once and reuse them across locations.
  • Single change process: One pipeline and quality gates, so deployments look the same everywhere.
  • Crisis communications muscle memory: Use NIST SP 800-34 for roles, exercises, and decision thresholds.⁷⁸
  • Shared observability vocabulary: Provider-agnostic metrics and traces allow apples-to-apples restoration reporting over time.

Financial discipline: Tie restoration work to avoided incident exposure and regulatory outcomes, not vanity metrics.

Measurement that proves resilience is improving

Track and present these five core metrics quarterly:

  1. RTO achieved vs target for Tier 1 drills.
  2. RPO achieved vs target for restored datasets.
  3. Change failure rate and mean time to restore, since delivery quality and resilience travel together.
  4. Hot-path dependency count, trending down as you remove or decouple third-party risk.
  5. Drill scorecard, including steps executed from Git, time for data rehydration, and operator workload.

Uptime Institute’s research notes that while frequency and severity have improved in recent years, impactful incidents still occur and can ripple across providers.⁵⁶ Your metrics show how you shorten restoration and contain impact. NIST’s guidance remains a practical scaffold for exercises and playbooks.⁷⁸

Talking to stakeholders when your cloud platform fails

  • We align with industry guidance. NIST SP 800-34 frames our plans and exercises.⁷⁸
  • We emphasise region choice and portability. This supports disaster recovery and sovereignty.⁴
  • We can operate in a degraded state. We know what goes read-only and what features we can shed during restoration.
  • We measure what matters. We report RTO, RPO, dependency count, and change failure rate. IBM’s 2025 research sets average breach cost at 4.44 million dollars, underscoring why disciplined resilience work remains essential when incidents overlap.⁹

Bottom line: start narrow, automate relentlessly, and make restoration a routine muscle. Upsun gives you a clear, Git-driven way to define environments, rehearse changes, and restore with confidence when the cloud has a bad day. To learn more: 

Sources

  1. The Verge. “Major AWS outage took down Fortnite, Alexa, Snapchat, and more.
  2. Financial Times. “Amazon says cloud services recovering from widespread outage.
  3. Le Monde. “AWS, le service cloud d’Amazon, annonce avoir résolu la panne...
  4. Gartner Newsroom. “Top trends shaping the future of cloud.
  5. Uptime Institute. “Annual Outage Analysis 2025.
  6. McMorrow Reports. “Uptime’s data center outage analysis: improvement but new risks.
  7. NIST SP 800-34 Rev. 1 page. “Contingency Planning Guide for Federal Information Systems.
  8. NIST SP 800-34 Rev. 1.
  9. Help Net Security summarising IBM’s 2025 study. “Average global data breach cost now $4.44 million.

Your greatest work
is just on the horizon

Free trial
© 2025 Upsun. All rights reserved.