guide

Circuit Breakers and Bulkheads: Engineering Resilience Patterns for DORA Banking Systems

DORA Atlas EditorialDecember 15, 202511 min read

When One Failure Becomes Many

On July 19, 2024, a single faulty update to a cybersecurity agent brought down systems across airlines, hospitals, banks, and government agencies worldwide. The technical root cause was a single piece of software — but the impact cascaded because interconnected systems lacked isolation boundaries. When the security agent crashed, it took the operating system with it. When the operating system crashed, every application on that host became unavailable. When those applications became unavailable, downstream services that depended on them failed. The cascade propagated through dependency chains that spanned organizations, sectors, and continents.

This is precisely the failure mode that DORA was designed to address. Article 9(1) requires financial entities to implement "ICT security policies, procedures, protocols and tools" that ensure "the protection of information assets and ICT assets, including the prevention of their deterioration, damage, unauthorized access or use." Article 11 goes further, requiring "ICT response and recovery plans" that address the containment of incidents and the prevention of propagation.

But DORA does not prescribe specific engineering patterns. It sets the outcome: cascading failures must be prevented, incidents must be contained, and critical functions must degrade gracefully rather than failing catastrophically. The engineering patterns that deliver these outcomes — circuit breakers, bulkheads, timeouts, retries with backoff, and graceful degradation hierarchies — are the subject of this guide.

Circuit Breaker Pattern

The circuit breaker pattern prevents a failing dependency from consuming the resources of the calling system. Named after electrical circuit breakers that prevent overcurrent from damaging downstream equipment, the software circuit breaker monitors the failure rate of calls to a dependency and "trips" when the failure rate exceeds a threshold.

Configuration for Banking Systems

Circuit breaker parameters must be calibrated to the specific dependency and the consequences of both false positives (tripping when the dependency is actually healthy) and false negatives (not tripping when the dependency is failing):

Parameter	Recommended default	Critical path	Non-critical path
Failure threshold	5 consecutive failures	3 consecutive failures	10 consecutive failures
Timeout per request	30 seconds	5 seconds	60 seconds
Reset timeout	60 seconds	30 seconds	120 seconds
Success threshold to close	3 consecutive successes	5 consecutive successes	2 consecutive successes
Half-open request limit	1	1	3

Critical path refers to dependencies that support critical functions — core banking, payment processing, authentication. These must trip faster (lower failure threshold) and recover more cautiously (higher success threshold to close) because the cost of continued failure is higher than the cost of temporary circuit interruption.

Non-critical path refers to dependencies that support ancillary functions — reporting, analytics, notifications. These can tolerate more failures before tripping because the impact of temporary unavailability is lower.

DORA Compliance Linkage

Circuit breakers directly address several DORA requirements:

DORA requirement	Circuit breaker contribution
Art. 9(1) prevention of deterioration	Prevents a failing dependency from degrading the calling system
Art. 10(1) anomaly detection	Circuit state transitions are monitorable events that indicate dependency health
Art. 11(2)(a) containment	Open circuit contains the failure to the affected dependency
Art. 11(5) communication	Circuit state can trigger automated stakeholder notification

Bulkhead Pattern

Where circuit breakers protect against dependency failures, bulkheads protect against resource exhaustion. Named after the watertight compartments in ship hulls that prevent a breach in one compartment from sinking the entire vessel, bulkhead isolation partitions system resources so that a surge or failure in one component cannot consume the resources needed by other components.

Resource Isolation Dimensions

Bulkhead isolation must be applied across multiple resource dimensions:

Resource	Isolation mechanism	Purpose
Thread pools	Dedicated thread pools per service category	Prevent thread starvation across unrelated services
Connection pools	Dedicated database connection pools per workload	Prevent query surge in one module from blocking others
Memory	Container memory limits or JVM heap partitioning	Prevent memory-hungry operations from triggering OOM
CPU	Container CPU limits or cgroup constraints	Prevent CPU-intensive operations from starving latency-sensitive ones
Queue capacity	Bounded queues per workload category	Prevent unbounded queue growth from exhausting memory
Storage I/O	Dedicated storage volumes or I/O scheduling	Prevent bulk operations from degrading transactional I/O

Priority Hierarchy for Financial Systems

When resource contention occurs despite bulkhead isolation, the system must know which functions to protect and which to shed. DORA Art. 11(3) requires that response and recovery plans prioritize the resumption of "critical or important functions." This maps to a graceful degradation hierarchy:

Priority	Function category	Degradation behavior	Recovery order
P0 (never shed)	Authentication, authorization	No degradation — if auth fails, everything fails securely	First
P1 (protect)	Payment processing, core banking transactions	Degrade non-essential features, maintain core transaction path	Second
P2 (important)	Evidence management, audit trail	Queue writes if under pressure, never drop	Third
P3 (degradable)	Report generation, exports, dashboards	Return cached data or "temporarily unavailable"	Fourth
P4 (deferrable)	Notifications, analytics, search	Defer processing to background queue	Fifth
P5 (shedable)	UI enhancements, non-critical integrations	Disable without user-visible error	Last

Timeout and Retry Patterns

Timeouts and retries are the granular mechanisms that complement circuit breakers and bulkheads. Without timeouts, a slow dependency can hold threads indefinitely, eventually exhausting the thread pool and causing cascading failure. Without retries (with appropriate backoff), transient failures that would resolve on their own become permanent failures.

Timeout Configuration

Dependency type	Connect timeout	Read timeout	Total timeout	DORA rationale
Database query (transactional)	5s	30s	30s	Art. 9 — prevent resource lock from slow queries
Database query (reporting)	5s	60s	60s	Art. 11 — reporting can tolerate longer execution
External API call	5s	15s	30s	Art. 9 — prevent third-party slowness from propagating
File/object storage	5s	60s	120s	Art. 12 — backup/restore may handle large objects
Authentication service	2s	5s	10s	Art. 9(4) — auth must be fast; slow auth degrades everything

Retry Strategy: Exponential Backoff with Jitter

Backoff parameters:

Base delay: 1 second
Maximum delay: 30 seconds
Jitter range: +/- 25% of calculated delay
Maximum retries: 3 for user-facing requests, 5 for background jobs

The jitter prevents thundering herd problems — when many clients retry simultaneously after a dependency recovers, overwhelming it again.

Monitoring and Observability

Resilience patterns are only effective if they are monitored. An open circuit breaker that nobody notices is an unreported partial outage. A bulkhead that is consistently at capacity is a capacity planning signal.

Golden Signals for Resilience Patterns

Metric	What it measures	Alert threshold	DORA reporting link
Circuit breaker state changes	Dependency health transitions	Any transition to OPEN	Art. 10 detection
Bulkhead utilization %	Resource pool consumption	>80% sustained for 5 minutes	Art. 9 capacity planning
Timeout rate by dependency	Dependency responsiveness	>5% of requests timing out	Art. 10 anomaly detection
Retry rate by dependency	Transient failure frequency	>10% of requests requiring retry	Art. 10 anomaly detection
Graceful degradation activations	System under stress	Any P3+ degradation active	Art. 11 incident response
Recovery time after circuit close	Dependency recovery speed	>5 minutes from OPEN to CLOSED	Art. 12 recovery targets

These metrics feed into the institution's DORA KPI dashboard and, for significant events, into Art. 14 board reporting.

Testing Resilience Patterns

Article 24-27 of DORA require resilience testing that covers the institution's ICT systems. Resilience patterns must be tested specifically:

Chaos engineering: Inject failures into dependencies and verify that circuit breakers trip, bulkheads contain the impact, and graceful degradation activates correctly. This is not theoretical analysis — it is controlled experimentation in a production-equivalent environment.

Load testing: Apply sustained load that exceeds normal capacity and verify that bulkhead isolation prevents resource exhaustion in critical paths. Measure the actual degradation hierarchy against the designed priority order.

Recovery testing: After a dependency failure, measure the actual recovery time — from circuit breaker OPEN to fully resumed service. Compare against the RTO targets defined in the business impact analysis.

Each test must produce evidence — the test configuration, the observed behavior, the metrics collected, and the comparison against expected outcomes. This evidence is stored in the evidence vault and demonstrates to supervisors that resilience patterns are not just designed but validated.

Use the DORA readiness assessment to evaluate your resilience engineering maturity, review the ENISA guidelines on ICT resilience testing for supervisory expectations, and consult the RTS/ITS reference for technical standards on ICT systems testing.

Anti-Patterns to Avoid

Retry without backoff. Immediate retries on failure amplify the load on an already struggling dependency. Always use exponential backoff with jitter.

Circuit breaker without monitoring. An open circuit breaker is a partial outage. If nobody is notified, users experience degraded service without understanding why, and the incident goes unreported.

Bulkhead with shared bottleneck. Isolating thread pools while sharing a single database connection pool defeats the purpose. Bulkhead isolation must be applied at every resource layer.

Graceful degradation without testing. Designing a degradation hierarchy on paper does not prove it works. Without testing, the first real activation may reveal unexpected dependencies between priority levels.

Timeouts longer than user patience. A 60-second timeout on a user-facing API means the user sees a spinner for 60 seconds before receiving an error. User-facing timeouts should be short (5-15 seconds) with clear feedback.

Conclusion

Circuit breakers, bulkheads, and graceful degradation are the engineering mechanisms that translate DORA's resilience requirements into running software. They prevent cascading failures (Art. 9), enable anomaly detection (Art. 10), contain incidents (Art. 11), and support recovery within defined targets (Art. 12).

But these patterns are not install-and-forget infrastructure components. They require careful configuration calibrated to each dependency's characteristics, continuous monitoring with alerts on state changes, regular testing under realistic failure conditions, and evidence that proves they work. The institutions that implement resilience patterns as a living engineering discipline will weather ICT incidents with contained impact and rapid recovery. The institutions that implement them as checkbox configurations will discover their inadequacy during a real cascading failure.

Resume en francais

Les articles 9 a 11 de DORA exigent des mecanismes de protection, de detection et de reponse empechant les defaillances en cascade dans les systemes TIC interconnectes. Cet article presente trois modeles d'ingenierie de resilience : les disjoncteurs (circuit breakers) qui protegent contre les defaillances de dependances avec des etats ferme/ouvert/semi-ouvert et des parametres calibres pour les chemins critiques et non-critiques, l'isolation par cloisons (bulkheads) qui partitionne les ressources systeme (pools de threads, connexions DB, memoire, CPU) pour empecher l'epuisement des ressources entre composants, et la degradation gracieuse avec une hierarchie de priorites a six niveaux (de P0 authentification a P5 ameliorations UI). L'article couvre egalement les strategies de timeout et de retry avec backoff exponentiel et jitter, les metriques d'observabilite pour chaque modele de resilience, les approches de test (chaos engineering, tests de charge, tests de recuperation) et les anti-patterns a eviter. Ces modeles ne sont pas des composants a installer et oublier — ils necessitent une configuration precise, une surveillance continue et des tests reguliers.