analysis

Multi-AZ Is Dead: How Physical Attacks Shattered Cloud Resilience Assumptions

DORA Atlas Editorial11 min read
Multi-AZ Is Dead: How Physical Attacks Shattered Cloud Resilience Assumptions

Multi-AZ Is Dead: How Physical Attacks Shattered Cloud Resilience Assumptions

"Deploy across multiple availability zones." This has been the cardinal rule of cloud architecture since AWS introduced AZs in 2008. The logic is elegant: by distributing workloads across physically separate data centers within a region, you protect against localized failures. If one AZ loses power or connectivity, the others continue operating.

On March 20, 2026, that logic met a ballistic missile.

As InfoQ reported on March 18 — remarkably, before the most severe strikes occurred — the Iran conflict had already damaged multiple AWS data centers in the same region. The full impact of the March 20 strikes confirmed what the InfoQ report foreshadowed: when the threat is not a facility-level failure but a regional military attack, multi-AZ deployment provides no meaningful protection.

The Multi-AZ Promise: What Cloud Providers Actually Guarantee

To understand what broke, we need to understand what was promised. Cloud providers define an availability zone as a physically distinct data center (or cluster of data centers) with independent power, cooling, and networking. Within a region, AZs are connected by high-bandwidth, low-latency networking but are separated by "meaningful distance" to reduce the risk of correlated failures.

The key assumption is independence: AZ failures should not be correlated. AWS's architecture documentation states that AZs are "physically separated by a meaningful distance, many kilometers, from any other AZ."

Provider AZ Separation Claim Failure Independence Claim Gulf Reality
AWS "Meaningful distance, many km" "Independent failure domains" 3 AZs within missile blast radius
Azure "300+ meters minimum" "Independent infrastructure" Not tested in Gulf strikes
Google Cloud "Low-risk zone isolation" "Rare simultaneous failure" Not tested in Gulf strikes

The problem in Bahrain is that "many kilometers" in a country of 780 km² means something very different than "many kilometers" in the continental United States. The entire island nation is smaller than most urban areas. AWS's three AZs in the me-south-1 region were physically separated by Bahraini standards but were all within the operational radius of a military strike package targeting the island.

What InfoQ Documented: Correlated Failure at Scale

InfoQ's March 18 report, "War in Iran Damages Multiple AWS Data Centers," provided the first detailed technical analysis of how the conflict affected cloud infrastructure. The report documented several critical findings:

First, the strikes caused simultaneous failures across multiple AZs. This was not a cascading failure where one AZ's outage overloaded another — it was a simultaneous physical impact on geographically proximate facilities.

Second, the shared infrastructure layer — power grid, telecommunications backbone, internet transit — failed regionally. Even AZs that were not directly struck lost connectivity because the regional infrastructure they depended on was damaged.

Third, the blast effects extended well beyond the direct impact zone. Electromagnetic pulse effects from detonations disrupted electronics in a radius larger than the physical destruction zone, affecting facilities that the strikes did not directly hit.

The Independence Assumption Was Always Fragile

The Gulf strikes did not create the correlated failure problem — they merely demonstrated it at a scale that was impossible to ignore. The independence assumption for AZs within a region has been eroding for years:

Power grid dependence: AZs within a region typically draw from the same national or regional power grid. When that grid fails — whether from a military strike, a severe storm, or a systemic failure like the 2025 Iberian blackout — all AZs in the region are affected simultaneously.

Network backbone sharing: The high-bandwidth connections between AZs and the internet transit points that connect the region to the global network are shared infrastructure. A strike on a telecommunications hub affects all AZs that route through it.

Human and operational dependencies: The same operations teams manage all AZs within a region. A regional emergency that affects the workforce — evacuation, conflict, natural disaster — degrades the human capacity to respond across all AZs simultaneously.

Shared Dependency Cloud Provider Mitigation Failure in Gulf Crisis
Power grid Diesel generators (48-96h) Grid destroyed; fuel delivery impossible due to conflict
Network backbone Redundant fiber paths All paths within small island; physically destroyed
Cooling water Closed-loop systems Systems damaged by blast effects
Operations teams Regional on-call staff Staff evacuated from conflict zone
Supply chain (parts) Regional spare inventory Logistics disrupted by military operations

What "Multi-Region" Actually Means Now

If multi-AZ within a single region cannot protect against correlated physical failure, the logical response is multi-region deployment — distributing workloads across regions that are geographically distant enough to ensure truly independent failure modes.

But multi-region architecture comes with significant costs and complexity that many financial institutions have been reluctant to accept:

Latency: Cross-region latency ranges from 20ms to 200ms+ depending on distance, compared to <2ms within a region. For latency-sensitive financial applications (payment processing, market data), this is a meaningful degradation.

Data consistency: Synchronous replication across regions is impractical at distance, requiring either asynchronous replication (with potential data loss) or conflict-resolution mechanisms (with operational complexity).

Cost: Multi-region active-active deployments roughly double infrastructure costs compared to multi-AZ within a single region. For financial institutions with tight IT budgets, this is a hard conversation with the CFO.

Operational complexity: Managing a multi-region deployment requires more sophisticated tooling, more experienced staff, and more rigorous testing than a single-region multi-AZ deployment.

DORA's Requirements in Light of Multi-AZ Failure

The Digital Operational Resilience Act (Regulation (EU) 2022/2554) does not prescribe specific architecture patterns. It requires outcomes: continuity of critical functions, recovery within defined timeframes, and demonstrated resilience through testing.

However, the Gulf strikes force a reinterpretation of several DORA requirements:

Article 11: Business Continuity Policy

DORA Article 11 requires business continuity policies that enable "a quick, appropriate and effective response to and resolution of all ICT-related incidents in a manner that limits damage and prioritises the resumption of activities." If the BCP assumes multi-AZ resilience within a single region, and that assumption has been proven invalid for certain threat scenarios, the BCP is inadequate.

Article 12: Recovery Plans

Article 12 requires recovery plans with "appropriate, scenario-based tests." The "appropriate scenarios" must now include total regional loss, not just single-AZ failure. Financial institutions that have only tested AZ-level failover need to expand their testing to include region-level failover.

Article 24-27: Resilience Testing

The resilience testing programme must include scenarios that challenge the independence assumption. A test that validates multi-AZ failover but never tests multi-region failover provides a false sense of security.

The European Banking Authority and ESMA should consider issuing guidance that explicitly addresses the minimum geographic separation required for disaster recovery sites. The current practice of DR within the same cloud region is no longer defensible for critical financial services.

Practical Architecture Patterns for Post-Multi-AZ Resilience

Financial institutions should adopt architecture patterns that do not rely on the independence of co-located infrastructure:

Active-Active Multi-Region: Deploy full application stacks in two or more regions on different continents. Use global load balancing to distribute traffic and ensure that each region can handle 100% of production load. This is the gold standard but the most expensive and complex option.

Active-Passive Multi-Region with Warm Standby: Maintain a fully provisioned but traffic-free secondary region that can be activated within minutes. This reduces cost but increases RTO compared to active-active.

Multi-Cloud Multi-Region: Deploy across different cloud providers in different regions. This addresses both provider concentration and geographic concentration risk. It is the most resilient but also the most operationally demanding pattern.

Pattern RTO RPO Monthly Cost Premium Complexity DORA Alignment
Multi-AZ single region Minutes Near-zero Baseline Low Insufficient for critical services
Active-passive multi-region 15-60 min Minutes +40-60% Medium Meets Art. 11/12 requirements
Active-active multi-region Near-zero Near-zero +80-120% High Exceeds requirements
Multi-cloud multi-region Near-zero Near-zero +100-150% Very high Maximum resilience

Conclusion: The End of "Good Enough" Resilience

For a decade, multi-AZ deployment was "good enough" for virtually every financial institution's resilience requirements. The Gulf strikes have demonstrated that "good enough" was a bet on the absence of correlated physical threats — a bet that paid off until it did not.

DORA-regulated entities must now confront the true cost of resilience. Multi-region deployment is more expensive, more complex, and harder to operate than multi-AZ. But the alternative — trusting that no military, seismic, or infrastructure event will simultaneously affect all AZs within a region — is no longer a credible risk management position.

The assessment of cloud resilience strategies must be updated. The concentration risk models must account for physical proximity. The testing programmes must include scenarios where entire regions disappear. And the board must be informed, per DORA Article 14, that the resilience posture they were assured of may not withstand the threats that 2026 has revealed.

Multi-AZ is not dead in the literal sense — it still protects against the most common failure modes. But as the sole resilience strategy for critical financial services, its days are over.


Voir aussi: Data Centers Are Now Military Targets | Submarine Cables Through the Strait of Hormuz | India as Plan B


Resume en francais

Le deploiement multi-AZ — la regle cardinale de l'architecture cloud depuis 2008 — reposait sur l'hypothese que les zones de disponibilite echouent independamment. Les frappes de mars 2026 sur les installations AWS a Bahrein ont brise cette hypothese : des missiles ont detruit ou endommage simultanement plusieurs AZ dans la meme region. L'analyse d'InfoQ du 18 mars a documente comment l'infrastructure partagee — reseau electrique, backbone de telecommunications, equipes operationnelles — a cree des modes de defaillance correles que le logiciel ne peut surmonter. Pour les entites DORA, les implications sont majeures : les politiques de continuite d'activite (Art. 11) supposant une resilience multi-AZ sont inadequates, les plans de reprise (Art. 12) doivent inclure la perte totale de region, et les programmes de test (Art. 24-27) doivent valider le basculement multi-region, pas seulement multi-AZ. L'architecture cible est desormais le deploiement actif-actif multi-region, avec une prime de cout de 80-120% mais une resilience correspondant au paysage de menaces reel.

Share