guide

Circuit Breakers and Bulkheads: Engineering Resilience Patterns for DORA Banking Systems

DORA Atlas Editorial11 min read
Circuit Breakers and Bulkheads: Engineering Resilience Patterns for DORA Banking Systems

When One Failure Becomes Many

On July 19, 2024, a single faulty update to a cybersecurity agent brought down systems across airlines, hospitals, banks, and government agencies worldwide. The technical root cause was a single piece of software — but the impact cascaded because interconnected systems lacked isolation boundaries. When the security agent crashed, it took the operating system with it. When the operating system crashed, every application on that host became unavailable. When those applications became unavailable, downstream services that depended on them failed. The cascade propagated through dependency chains that spanned organizations, sectors, and continents.

This is precisely the failure mode that DORA was designed to address. Article 9(1) requires financial entities to implement "ICT security policies, procedures, protocols and tools" that ensure "the protection of information assets and ICT assets, including the prevention of their deterioration, damage, unauthorized access or use." Article 11 goes further, requiring "ICT response and recovery plans" that address the containment of incidents and the prevention of propagation.

But DORA does not prescribe specific engineering patterns. It sets the outcome: cascading failures must be prevented, incidents must be contained, and critical functions must degrade gracefully rather than failing catastrophically. The engineering patterns that deliver these outcomes — circuit breakers, bulkheads, timeouts, retries with backoff, and graceful degradation hierarchies — are the subject of this guide.

Circuit Breaker Pattern

The circuit breaker pattern prevents a failing dependency from consuming the resources of the calling system. Named after electrical circuit breakers that prevent overcurrent from damaging downstream equipment, the software circuit breaker monitors the failure rate of calls to a dependency and "trips" when the failure rate exceeds a threshold.

Configuration for Banking Systems

Circuit breaker parameters must be calibrated to the specific dependency and the consequences of both false positives (tripping when the dependency is actually healthy) and false negatives (not tripping when the dependency is failing):

Parameter Recommended default Critical path Non-critical path
Failure threshold 5 consecutive failures 3 consecutive failures 10 consecutive failures
Timeout per request 30 seconds 5 seconds 60 seconds
Reset timeout 60 seconds 30 seconds 120 seconds
Success threshold to close 3 consecutive successes 5 consecutive successes 2 consecutive successes
Half-open request limit 1 1 3

Critical path refers to dependencies that support critical functions — core banking, payment processing, authentication. These must trip faster (lower failure threshold) and recover more cautiously (higher success threshold to close) because the cost of continued failure is higher than the cost of temporary circuit interruption.

Non-critical path refers to dependencies that support ancillary functions — reporting, analytics, notifications. These can tolerate more failures before tripping because the impact of temporary unavailability is lower.

DORA Compliance Linkage

Circuit breakers directly address several DORA requirements:

DORA requirement Circuit breaker contribution
Art. 9(1) prevention of deterioration Prevents a failing dependency from degrading the calling system
Art. 10(1) anomaly detection Circuit state transitions are monitorable events that indicate dependency health
Art. 11(2)(a) containment Open circuit contains the failure to the affected dependency
Art. 11(5) communication Circuit state can trigger automated stakeholder notification

Bulkhead Pattern

Where circuit breakers protect against dependency failures, bulkheads protect against resource exhaustion. Named after the watertight compartments in ship hulls that prevent a breach in one compartment from sinking the entire vessel, bulkhead isolation partitions system resources so that a surge or failure in one component cannot consume the resources needed by other components.

Resource Isolation Dimensions

Bulkhead isolation must be applied across multiple resource dimensions:

Resource Isolation mechanism Purpose
Thread pools Dedicated thread pools per service category Prevent thread starvation across unrelated services
Connection pools Dedicated database connection pools per workload Prevent query surge in one module from blocking others
Memory Container memory limits or JVM heap partitioning Prevent memory-hungry operations from triggering OOM
CPU Container CPU limits or cgroup constraints Prevent CPU-intensive operations from starving latency-sensitive ones
Queue capacity Bounded queues per workload category Prevent unbounded queue growth from exhausting memory
Storage I/O Dedicated storage volumes or I/O scheduling Prevent bulk operations from degrading transactional I/O

Priority Hierarchy for Financial Systems

When resource contention occurs despite bulkhead isolation, the system must know which functions to protect and which to shed. DORA Art. 11(3) requires that response and recovery plans prioritize the resumption of "critical or important functions." This maps to a graceful degradation hierarchy:

Priority Function category Degradation behavior Recovery order
P0 (never shed) Authentication, authorization No degradation — if auth fails, everything fails securely First
P1 (protect) Payment processing, core banking transactions Degrade non-essential features, maintain core transaction path Second
P2 (important) Evidence management, audit trail Queue writes if under pressure, never drop Third
P3 (degradable) Report generation, exports, dashboards Return cached data or "temporarily unavailable" Fourth
P4 (deferrable) Notifications, analytics, search Defer processing to background queue Fifth
P5 (shedable) UI enhancements, non-critical integrations Disable without user-visible error Last

Timeout and Retry Patterns

Timeouts and retries are the granular mechanisms that complement circuit breakers and bulkheads. Without timeouts, a slow dependency can hold threads indefinitely, eventually exhausting the thread pool and causing cascading failure. Without retries (with appropriate backoff), transient failures that would resolve on their own become permanent failures.

Timeout Configuration

Dependency type Connect timeout Read timeout Total timeout DORA rationale
Database query (transactional) 5s 30s 30s Art. 9 — prevent resource lock from slow queries
Database query (reporting) 5s 60s 60s Art. 11 — reporting can tolerate longer execution
External API call 5s 15s 30s Art. 9 — prevent third-party slowness from propagating
File/object storage 5s 60s 120s Art. 12 — backup/restore may handle large objects
Authentication service 2s 5s 10s Art. 9(4) — auth must be fast; slow auth degrades everything

Retry Strategy: Exponential Backoff with Jitter

Backoff parameters:

  • Base delay: 1 second
  • Maximum delay: 30 seconds
  • Jitter range: +/- 25% of calculated delay
  • Maximum retries: 3 for user-facing requests, 5 for background jobs

The jitter prevents thundering herd problems — when many clients retry simultaneously after a dependency recovers, overwhelming it again.

Monitoring and Observability

Resilience patterns are only effective if they are monitored. An open circuit breaker that nobody notices is an unreported partial outage. A bulkhead that is consistently at capacity is a capacity planning signal.

Golden Signals for Resilience Patterns

Metric What it measures Alert threshold DORA reporting link
Circuit breaker state changes Dependency health transitions Any transition to OPEN Art. 10 detection
Bulkhead utilization % Resource pool consumption >80% sustained for 5 minutes Art. 9 capacity planning
Timeout rate by dependency Dependency responsiveness >5% of requests timing out Art. 10 anomaly detection
Retry rate by dependency Transient failure frequency >10% of requests requiring retry Art. 10 anomaly detection
Graceful degradation activations System under stress Any P3+ degradation active Art. 11 incident response
Recovery time after circuit close Dependency recovery speed >5 minutes from OPEN to CLOSED Art. 12 recovery targets

These metrics feed into the institution's DORA KPI dashboard and, for significant events, into Art. 14 board reporting.

Testing Resilience Patterns

Article 24-27 of DORA require resilience testing that covers the institution's ICT systems. Resilience patterns must be tested specifically:

Chaos engineering: Inject failures into dependencies and verify that circuit breakers trip, bulkheads contain the impact, and graceful degradation activates correctly. This is not theoretical analysis — it is controlled experimentation in a production-equivalent environment.

Load testing: Apply sustained load that exceeds normal capacity and verify that bulkhead isolation prevents resource exhaustion in critical paths. Measure the actual degradation hierarchy against the designed priority order.

Recovery testing: After a dependency failure, measure the actual recovery time — from circuit breaker OPEN to fully resumed service. Compare against the RTO targets defined in the business impact analysis.

Each test must produce evidence — the test configuration, the observed behavior, the metrics collected, and the comparison against expected outcomes. This evidence is stored in the evidence vault and demonstrates to supervisors that resilience patterns are not just designed but validated.

Use the DORA readiness assessment to evaluate your resilience engineering maturity, review the ENISA guidelines on ICT resilience testing for supervisory expectations, and consult the RTS/ITS reference for technical standards on ICT systems testing.

Anti-Patterns to Avoid

Retry without backoff. Immediate retries on failure amplify the load on an already struggling dependency. Always use exponential backoff with jitter.

Circuit breaker without monitoring. An open circuit breaker is a partial outage. If nobody is notified, users experience degraded service without understanding why, and the incident goes unreported.

Bulkhead with shared bottleneck. Isolating thread pools while sharing a single database connection pool defeats the purpose. Bulkhead isolation must be applied at every resource layer.

Graceful degradation without testing. Designing a degradation hierarchy on paper does not prove it works. Without testing, the first real activation may reveal unexpected dependencies between priority levels.

Timeouts longer than user patience. A 60-second timeout on a user-facing API means the user sees a spinner for 60 seconds before receiving an error. User-facing timeouts should be short (5-15 seconds) with clear feedback.

Conclusion

Circuit breakers, bulkheads, and graceful degradation are the engineering mechanisms that translate DORA's resilience requirements into running software. They prevent cascading failures (Art. 9), enable anomaly detection (Art. 10), contain incidents (Art. 11), and support recovery within defined targets (Art. 12).

But these patterns are not install-and-forget infrastructure components. They require careful configuration calibrated to each dependency's characteristics, continuous monitoring with alerts on state changes, regular testing under realistic failure conditions, and evidence that proves they work. The institutions that implement resilience patterns as a living engineering discipline will weather ICT incidents with contained impact and rapid recovery. The institutions that implement them as checkbox configurations will discover their inadequacy during a real cascading failure.


Resume en francais

Les articles 9 a 11 de DORA exigent des mecanismes de protection, de detection et de reponse empechant les defaillances en cascade dans les systemes TIC interconnectes. Cet article presente trois modeles d'ingenierie de resilience : les disjoncteurs (circuit breakers) qui protegent contre les defaillances de dependances avec des etats ferme/ouvert/semi-ouvert et des parametres calibres pour les chemins critiques et non-critiques, l'isolation par cloisons (bulkheads) qui partitionne les ressources systeme (pools de threads, connexions DB, memoire, CPU) pour empecher l'epuisement des ressources entre composants, et la degradation gracieuse avec une hierarchie de priorites a six niveaux (de P0 authentification a P5 ameliorations UI). L'article couvre egalement les strategies de timeout et de retry avec backoff exponentiel et jitter, les metriques d'observabilite pour chaque modele de resilience, les approches de test (chaos engineering, tests de charge, tests de recuperation) et les anti-patterns a eviter. Ces modeles ne sont pas des composants a installer et oublier — ils necessitent une configuration precise, une surveillance continue et des tests reguliers.

Share