guide

Cloud-Native vs Legacy: Two Resilience Strategies, One DORA Framework

DORA Atlas Editorial12 min read
Cloud-Native vs Legacy: Two Resilience Strategies, One DORA Framework

Two Architectures, One Regulation

Walk into the technology center of any major European bank and you will find two realities operating side by side. On one side: Kubernetes clusters running containerized microservices, auto-scaling based on demand, self-healing when pods fail, deployed through CI/CD pipelines multiple times per day. On the other: IBM Z mainframes processing nightly COBOL batch jobs, running core banking ledger operations that have not been architecturally changed in 20 years, deployed through change windows that take weeks to schedule.

Both architectures process payments. Both store customer data. Both support critical and important functions as defined by DORA Art. 3(22). Both must comply with the same regulation — the same Art. 7 reliability requirements, the same Art. 11 business continuity obligations, the same Art. 24 testing mandates.

But the way they achieve resilience is fundamentally different. Cloud-native architectures are designed for failure — they expect components to fail and recover automatically. Legacy architectures are designed to prevent failure — they invest in hardware redundancy, cold standby, and carefully tested failover procedures. Neither approach is inherently superior for DORA compliance. Each has strengths that map to certain DORA requirements and weaknesses that require compensating controls.

The challenge for most European financial institutions is not choosing one paradigm or the other. It is operating both simultaneously under a unified DORA compliance programme.

How DORA Requirements Map to Each Paradigm

Art. 7: ICT Systems Reliability

Art. 7 requires financial entities to "use and maintain updated ICT systems, protocols and tools" that are "reliable, have sufficient capacity, and are technologically resilient." The approaches differ dramatically:

DORA Art. 7 Dimension Cloud-Native Approach Legacy Approach
Reliability Redundancy through replication; any instance can fail without service impact Redundancy through hardware; dual processors, mirrored storage, hot standby
Capacity Auto-scaling based on real-time demand; horizontal elasticity Capacity planning based on peak projections; vertical scaling limited by hardware
Technological resilience Self-healing (Kubernetes restarts failed containers); rolling deployments Scheduled maintenance windows; manual failover procedures
Currency Continuous deployment; patches deployed in minutes Change windows; patches tested for weeks before deployment

Cloud-native architectures excel at Art. 7's capacity requirement because auto-scaling adjusts resources dynamically. But they can struggle with reliability when microservice dependencies create cascading failure modes — a single misconfigured service can take down a service mesh.

Legacy architectures excel at Art. 7's reliability for individual components — a mainframe's mean time between failures (MTBF) can exceed five years. But they struggle with currency: the update cycle for a COBOL core banking system can be measured in months, not minutes, creating a window of vulnerability that is difficult to defend.

Art. 11: Business Continuity

Art. 11 requires "business continuity policies and ICT disaster recovery plans" that include "recovery time objectives and recovery point objectives for each function." Here the paradigms diverge most starkly:

Cloud-native RTOs can be sub-second for stateless services (Kubernetes reschedules a failed pod in seconds) but can be hours for stateful systems (database failover, state reconstruction, cache warming). The RTO depends on the service's state management strategy.

Legacy RTOs are typically 2-8 hours for mainframe failover (cold standby activation, data synchronization, application restart, transaction reconciliation) but are highly predictable. The failover procedure is documented, tested, and well-understood.

For RPO, cloud-native architectures using event-sourced databases or synchronous multi-region replication can achieve near-zero RPO. Legacy batch processing systems may have RPO equal to the batch cycle — typically 24 hours for nightly batch, which represents a significant data loss risk.

Art. 24-27: Testing

Art. 24-27 require a digital operational resilience testing programme. The testing approaches differ:

Test Type Cloud-Native Legacy
Vulnerability scanning Automated in CI/CD; container image scanning; infrastructure-as-code validation Scheduled quarterly scans; manual review of mainframe security configurations
Scenario-based testing Chaos engineering (injecting failures into production); game days Tabletop exercises; scheduled DR drills with predetermined scenarios
Performance testing Load testing in staging with production-like traffic patterns; canary deployments Capacity stress tests during maintenance windows; offline performance benchmarking
Recovery testing Chaos Monkey-style automated recovery validation; blue-green deployment rollback Annual DR failover test; manual execution with documented runbook
TLPT API penetration testing, container escape testing, service mesh exploitation Mainframe penetration testing (specialized skill), CICS transaction exploitation

Cloud-native environments enable continuous testing — chaos engineering practices like Netflix's Chaos Monkey philosophy, where failures are deliberately injected into production to validate resilience. This goes beyond tabletop exercises as DORA intends. Legacy environments require more structured, scheduled testing with careful change control.

The Hybrid Reality

Most European financial institutions are neither fully cloud-native nor fully legacy. They operate hybrid estates where:

  • Customer-facing APIs run on Kubernetes in the cloud
  • Core banking ledger operations run on mainframes
  • Middle-office systems run on traditional virtualized infrastructure
  • Data analytics run on cloud-native big data platforms
  • Some systems are mid-migration — partly containerized, partly monolithic

This hybrid reality creates integration points that are the highest-risk elements from a DORA perspective. The API gateway that translates between a cloud-native microservice and a mainframe transaction is both a critical dependency and a single point of failure.

The integration layer — enterprise service buses, message queues, transaction gateways, protocol translators — carries the resilience risk of both paradigms and the strengths of neither. It runs on infrastructure that is rarely as well-managed as either the cloud or the mainframe. It is often the oldest, least-documented, and most fragile component of the technology estate.

DORA Compliance Strategy for Hybrid Estates

1. Unified Asset Register Covering Both Paradigms

Art. 8 requires identification and classification of all ICT assets. For hybrid estates, the asset register must span cloud resources (containers, managed services, serverless functions), legacy systems (mainframes, midrange, COBOL applications), and — critically — the integration layer that connects them. The register must capture dependencies across paradigm boundaries.

2. Risk Assessment That Acknowledges Different Failure Modes

Cloud-native systems fail often but recover quickly (transient failures, pod restarts, network partitions). Legacy systems fail rarely but recover slowly (hardware failure, manual failover, batch reconciliation). The risk assessment under Art. 6 must account for these different failure profiles:

Risk Dimension Cloud-Native Risk Legacy Risk
Frequency of failure High (designed for it) Low (hardware redundancy)
Recovery time Seconds to minutes (automated) Hours (manual procedures)
Cascading failure High (microservice dependency chains) Low (monolithic isolation)
Data loss risk Low (event sourcing, replication) Medium (batch cycle RPO gap)
Vendor lock-in High (cloud provider dependency) Low (on-premises control)
Skills availability Growing (cloud-native talent market) Declining (mainframe expertise retiring)
Compliance evidence Automated (infrastructure as code, logs) Manual (run books, sign-off sheets)

3. Testing Programme That Covers Both Worlds

The Art. 24-27 testing programme must include tests for cloud-native resilience (chaos engineering, container security, auto-scaling validation), legacy resilience (DR failover, batch recovery, mainframe security), and — most importantly — integration resilience (what happens when the message queue between the cloud and the mainframe fails? what happens when the enterprise service bus is unavailable for 4 hours during a mainframe failover?).

4. Continuity Plans That Account for Paradigm Interactions

The worst-case scenarios for hybrid estates involve failures at the integration boundary. A cloud-native service that cannot reach the mainframe creates a split-brain situation: the customer-facing API shows one state, the core banking ledger shows another. The Art. 11 continuity plan must document how to reconcile these states after recovery.

5. Evidence Collection That Bridges Both Paradigms

DORA's implicit evidence requirements apply to both paradigms, but the evidence collection mechanisms differ. Cloud-native systems produce immense volumes of structured logs, metrics, and traces automatically. Legacy systems produce scheduled reports, operator logs, and batch processing summaries. The evidence management strategy must normalize evidence from both sources into a unified, auditable format.

The Migration Question

Many European financial institutions are mid-journey on cloud migration programs. DORA adds a new dimension to migration decisions: operational resilience. The migration itself is a DORA risk — moving a critical function from one architecture to another creates a transition period where neither architecture provides full resilience.

DORA does not mandate any specific architecture. A financial institution can achieve full compliance with mainframes, with cloud-native infrastructure, or with any combination. What DORA mandates is that the institution can demonstrate resilience — through testing, evidence, governance, and recovery — regardless of the architecture.

The supervisory question is not "are you in the cloud?" but "can you recover this critical function within the RTO your BIA defines, with evidence that proves you tested that recovery?" Whether the answer involves Kubernetes pod rescheduling or mainframe cold standby activation is an implementation detail.

Supervisory Expectations for Hybrid Estates

The ECB and national competent authorities understand that significant institutions operate hybrid estates. Their examination focus will be on:

  • Integration point resilience: Are the connections between cloud and legacy systems identified, risk-assessed, and tested?
  • Consistent governance: Is the DORA compliance programme applied uniformly, or does the cloud team have different governance from the mainframe team?
  • Skills and knowledge: Are there sufficient personnel who understand both paradigms and, critically, the integration layer?
  • Vendor management consistency: Are cloud service providers managed with the same rigor as traditional outsourcing arrangements?

Key Takeaways

  • DORA is architecture-agnostic — it applies equally to cloud-native and legacy systems. Compliance is about demonstrable resilience, not technology choice.
  • Hybrid estates carry the highest risk at the integration layer between cloud and legacy. This is where DORA testing should focus.
  • Cloud-native and legacy have complementary strengths: cloud excels at capacity and automated recovery; legacy excels at single-component reliability and data consistency.
  • The asset register must span both paradigms and explicitly capture cross-paradigm dependencies.
  • Testing programmes must include integration failure scenarios — not just cloud resilience or mainframe DR in isolation.
  • Migration itself is a DORA risk that must be governed with the same rigor as production operations.

Resume en francais

Les institutions financieres europeennes operent dans un monde hybride : microservices cloud-natifs deployes sur Kubernetes et mainframes executant du COBOL depuis des decennies. DORA s'applique de maniere identique aux deux paradigmes — ce qui compte, c'est la capacite demontrable de resilience, pas le choix technologique. Les architectures cloud-natives excellent en elasticite et recuperation automatisee (RTO en secondes) mais sont vulnerables aux pannes en cascade des microservices. Les architectures heritage offrent une fiabilite composant exceptionnelle (MTBF de cinq ans) mais des temps de recuperation de plusieurs heures et des cycles de correctifs de plusieurs semaines. Le risque le plus eleve dans les environnements hybrides se situe a la couche d'integration — bus de services, files de messages, passerelles transactionnelles — qui ne beneficie des atouts d'aucun paradigme. Le programme de tests (Art. 24-27) doit couvrir les trois couches : resilience cloud, reprise mainframe et, surtout, les scenarios de defaillance a la frontiere d'integration. L'inventaire d'actifs (Art. 8) doit capturer les dependances inter-paradigmes, et les plans de continuite (Art. 11) doivent documenter la reconciliation apres une defaillance en « cerveau divise » entre les couches.

Share