Circuit Breakers and Bulkheads: Engineering Resilience Patterns for DORA Banking Systems

When One Failure Becomes Many
On July 19, 2024, a single faulty update to a cybersecurity agent brought down systems across airlines, hospitals, banks, and government agencies worldwide. The technical root cause was a single piece of software — but the impact cascaded because interconnected systems lacked isolation boundaries. When the security agent crashed, it took the operating system with it. When the operating system crashed, every application on that host became unavailable. When those applications became unavailable, downstream services that depended on them failed. The cascade propagated through dependency chains that spanned organizations, sectors, and continents.
This is precisely the failure mode that DORA was designed to address. Article 9(1) requires financial entities to implement "ICT security policies, procedures, protocols and tools" that ensure "the protection of information assets and ICT assets, including the prevention of their deterioration, damage, unauthorized access or use." Article 11 goes further, requiring "ICT response and recovery plans" that address the containment of incidents and the prevention of propagation.
But DORA does not prescribe specific engineering patterns. It sets the outcome: cascading failures must be prevented, incidents must be contained, and critical functions must degrade gracefully rather than failing catastrophically. The engineering patterns that deliver these outcomes — circuit breakers, bulkheads, timeouts, retries with backoff, and graceful degradation hierarchies — are the subject of this guide.
Circuit Breaker Pattern
The circuit breaker pattern prevents a failing dependency from consuming the resources of the calling system. Named after electrical circuit breakers that prevent overcurrent from damaging downstream equipment, the software circuit breaker monitors the failure rate of calls to a dependency and "trips" when the failure rate exceeds a threshold.
Configuration for Banking Systems
Circuit breaker parameters must be calibrated to the specific dependency and the consequences of both false positives (tripping when the dependency is actually healthy) and false negatives (not tripping when the dependency is failing):
| Parameter | Recommended default | Critical path | Non-critical path |
|---|---|---|---|
| Failure threshold | 5 consecutive failures | 3 consecutive failures | 10 consecutive failures |
| Timeout per request | 30 seconds | 5 seconds | 60 seconds |
| Reset timeout | 60 seconds | 30 seconds | 120 seconds |
| Success threshold to close | 3 consecutive successes | 5 consecutive successes | 2 consecutive successes |
| Half-open request limit | 1 | 1 | 3 |
Critical path refers to dependencies that support critical functions — core banking, payment processing, authentication. These must trip faster (lower failure threshold) and recover more cautiously (higher success threshold to close) because the cost of continued failure is higher than the cost of temporary circuit interruption.
Non-critical path refers to dependencies that support ancillary functions — reporting, analytics, notifications. These can tolerate more failures before tripping because the impact of temporary unavailability is lower.
DORA Compliance Linkage
Circuit breakers directly address several DORA requirements:
| DORA requirement | Circuit breaker contribution |
|---|---|
| Art. 9(1) prevention of deterioration | Prevents a failing dependency from degrading the calling system |
| Art. 10(1) anomaly detection | Circuit state transitions are monitorable events that indicate dependency health |
| Art. 11(2)(a) containment | Open circuit contains the failure to the affected dependency |
| Art. 11(5) communication | Circuit state can trigger automated stakeholder notification |
Bulkhead Pattern
Where circuit breakers protect against dependency failures, bulkheads protect against resource exhaustion. Named after the watertight compartments in ship hulls that prevent a breach in one compartment from sinking the entire vessel, bulkhead isolation partitions system resources so that a surge or failure in one component cannot consume the resources needed by other components.
Resource Isolation Dimensions
Bulkhead isolation must be applied across multiple resource dimensions:
| Resource | Isolation mechanism | Purpose |
|---|---|---|
| Thread pools | Dedicated thread pools per service category | Prevent thread starvation across unrelated services |
| Connection pools | Dedicated database connection pools per workload | Prevent query surge in one module from blocking others |
| Memory | Container memory limits or JVM heap partitioning | Prevent memory-hungry operations from triggering OOM |
| CPU | Container CPU limits or cgroup constraints | Prevent CPU-intensive operations from starving latency-sensitive ones |
| Queue capacity | Bounded queues per workload category | Prevent unbounded queue growth from exhausting memory |
| Storage I/O | Dedicated storage volumes or I/O scheduling | Prevent bulk operations from degrading transactional I/O |
Priority Hierarchy for Financial Systems
When resource contention occurs despite bulkhead isolation, the system must know which functions to protect and which to shed. DORA Art. 11(3) requires that response and recovery plans prioritize the resumption of "critical or important functions." This maps to a graceful degradation hierarchy:
| Priority | Function category | Degradation behavior | Recovery order |
|---|---|---|---|
| P0 (never shed) | Authentication, authorization | No degradation — if auth fails, everything fails securely | First |
| P1 (protect) | Payment processing, core banking transactions | Degrade non-essential features, maintain core transaction path | Second |
| P2 (important) | Evidence management, audit trail | Queue writes if under pressure, never drop | Third |
| P3 (degradable) | Report generation, exports, dashboards | Return cached data or "temporarily unavailable" | Fourth |
| P4 (deferrable) | Notifications, analytics, search | Defer processing to background queue | Fifth |
| P5 (shedable) | UI enhancements, non-critical integrations | Disable without user-visible error | Last |
Timeout and Retry Patterns
Timeouts and retries are the granular mechanisms that complement circuit breakers and bulkheads. Without timeouts, a slow dependency can hold threads indefinitely, eventually exhausting the thread pool and causing cascading failure. Without retries (with appropriate backoff), transient failures that would resolve on their own become permanent failures.
Timeout Configuration
| Dependency type | Connect timeout | Read timeout | Total timeout | DORA rationale |
|---|---|---|---|---|
| Database query (transactional) | 5s | 30s | 30s | Art. 9 — prevent resource lock from slow queries |
| Database query (reporting) | 5s | 60s | 60s | Art. 11 — reporting can tolerate longer execution |
| External API call | 5s | 15s | 30s | Art. 9 — prevent third-party slowness from propagating |
| File/object storage | 5s | 60s | 120s | Art. 12 — backup/restore may handle large objects |
| Authentication service | 2s | 5s | 10s | Art. 9(4) — auth must be fast; slow auth degrades everything |
Retry Strategy: Exponential Backoff with Jitter
Backoff parameters:
- Base delay: 1 second
- Maximum delay: 30 seconds
- Jitter range: +/- 25% of calculated delay
- Maximum retries: 3 for user-facing requests, 5 for background jobs
The jitter prevents thundering herd problems — when many clients retry simultaneously after a dependency recovers, overwhelming it again.
Monitoring and Observability
Resilience patterns are only effective if they are monitored. An open circuit breaker that nobody notices is an unreported partial outage. A bulkhead that is consistently at capacity is a capacity planning signal.
Golden Signals for Resilience Patterns
| Metric | What it measures | Alert threshold | DORA reporting link |
|---|---|---|---|
| Circuit breaker state changes | Dependency health transitions | Any transition to OPEN | Art. 10 detection |
| Bulkhead utilization % | Resource pool consumption | >80% sustained for 5 minutes | Art. 9 capacity planning |
| Timeout rate by dependency | Dependency responsiveness | >5% of requests timing out | Art. 10 anomaly detection |
| Retry rate by dependency | Transient failure frequency | >10% of requests requiring retry | Art. 10 anomaly detection |
| Graceful degradation activations | System under stress | Any P3+ degradation active | Art. 11 incident response |
| Recovery time after circuit close | Dependency recovery speed | >5 minutes from OPEN to CLOSED | Art. 12 recovery targets |
These metrics feed into the institution's DORA KPI dashboard and, for significant events, into Art. 14 board reporting.
Testing Resilience Patterns
Article 24-27 of DORA require resilience testing that covers the institution's ICT systems. Resilience patterns must be tested specifically:
Chaos engineering: Inject failures into dependencies and verify that circuit breakers trip, bulkheads contain the impact, and graceful degradation activates correctly. This is not theoretical analysis — it is controlled experimentation in a production-equivalent environment.
Load testing: Apply sustained load that exceeds normal capacity and verify that bulkhead isolation prevents resource exhaustion in critical paths. Measure the actual degradation hierarchy against the designed priority order.
Recovery testing: After a dependency failure, measure the actual recovery time — from circuit breaker OPEN to fully resumed service. Compare against the RTO targets defined in the business impact analysis.
Each test must produce evidence — the test configuration, the observed behavior, the metrics collected, and the comparison against expected outcomes. This evidence is stored in the evidence vault and demonstrates to supervisors that resilience patterns are not just designed but validated.
Use the DORA readiness assessment to evaluate your resilience engineering maturity, review the ENISA guidelines on ICT resilience testing for supervisory expectations, and consult the RTS/ITS reference for technical standards on ICT systems testing.
Anti-Patterns to Avoid
Retry without backoff. Immediate retries on failure amplify the load on an already struggling dependency. Always use exponential backoff with jitter.
Circuit breaker without monitoring. An open circuit breaker is a partial outage. If nobody is notified, users experience degraded service without understanding why, and the incident goes unreported.
Bulkhead with shared bottleneck. Isolating thread pools while sharing a single database connection pool defeats the purpose. Bulkhead isolation must be applied at every resource layer.
Graceful degradation without testing. Designing a degradation hierarchy on paper does not prove it works. Without testing, the first real activation may reveal unexpected dependencies between priority levels.
Timeouts longer than user patience. A 60-second timeout on a user-facing API means the user sees a spinner for 60 seconds before receiving an error. User-facing timeouts should be short (5-15 seconds) with clear feedback.
Conclusion
Circuit breakers, bulkheads, and graceful degradation are the engineering mechanisms that translate DORA's resilience requirements into running software. They prevent cascading failures (Art. 9), enable anomaly detection (Art. 10), contain incidents (Art. 11), and support recovery within defined targets (Art. 12).
But these patterns are not install-and-forget infrastructure components. They require careful configuration calibrated to each dependency's characteristics, continuous monitoring with alerts on state changes, regular testing under realistic failure conditions, and evidence that proves they work. The institutions that implement resilience patterns as a living engineering discipline will weather ICT incidents with contained impact and rapid recovery. The institutions that implement them as checkbox configurations will discover their inadequacy during a real cascading failure.
Resume en francais
Les articles 9 a 11 de DORA exigent des mecanismes de protection, de detection et de reponse empechant les defaillances en cascade dans les systemes TIC interconnectes. Cet article presente trois modeles d'ingenierie de resilience : les disjoncteurs (circuit breakers) qui protegent contre les defaillances de dependances avec des etats ferme/ouvert/semi-ouvert et des parametres calibres pour les chemins critiques et non-critiques, l'isolation par cloisons (bulkheads) qui partitionne les ressources systeme (pools de threads, connexions DB, memoire, CPU) pour empecher l'epuisement des ressources entre composants, et la degradation gracieuse avec une hierarchie de priorites a six niveaux (de P0 authentification a P5 ameliorations UI). L'article couvre egalement les strategies de timeout et de retry avec backoff exponentiel et jitter, les metriques d'observabilite pour chaque modele de resilience, les approches de test (chaos engineering, tests de charge, tests de recuperation) et les anti-patterns a eviter. Ces modeles ne sont pas des composants a installer et oublier — ils necessitent une configuration precise, une surveillance continue et des tests reguliers.