guide

Observability Under DORA: Why Correlation IDs and Structured Logging Are Now Regulatory Requirements

DORA Atlas EditorialNovember 10, 202511 min read

The Detection Gap

When a financial institution suffers an ICT incident, the clock starts immediately. DORA Art. 17 requires classification. Art. 19 requires initial notification to competent authorities within hours for major incidents. Art. 13 requires that lessons be extracted and the ICT risk management framework improved accordingly.

Each of these requirements depends on a capability that most European financial institutions take for granted: the ability to understand what happened, when it happened, and what was affected. This capability has a name: observability.

Observability is the ability to infer the internal state of a system from its external outputs — logs, metrics, and traces. In a modern financial institution, a single customer payment might traverse 15-30 internal services: API gateway, authentication, authorization, fraud detection, sanctions screening, account validation, payment routing, core banking ledger, confirmation generation, and notification dispatch. If any of these services fails — silently, intermittently, or catastrophically — the institution must detect the failure, classify it, determine its impact, and report it. Without observability, this is impossible within DORA's timelines.

The ECB's 2024 cyber stress test found that detection time — the interval between incident occurrence and institutional awareness — was one of the widest-varying metrics across 109 banks. Some institutions detected incidents in minutes. Others took days. The difference was not technology budget. It was observability maturity.

DORA's Implicit Observability Requirements

DORA does not use the word "observability." But its requirements describe observability capabilities with precision:

Art. 10: Detection

Art. 10 requires "mechanisms to promptly detect anomalous activities, including ICT network performance issues and ICT-related incidents." This requires:

Baseline behavior models — you cannot detect anomalies without knowing what is normal
Real-time monitoring — "promptly" means detection lag measured in minutes, not hours
Comprehensive coverage — the regulation specifies ICT network, performance, and security events
Automated alerting — human-only detection cannot achieve "prompt" detection at scale

Art. 17: Incident Classification

Art. 17(1) requires financial entities to "establish, implement and maintain an ICT-related incident management process to detect, manage and notify ICT-related incidents." Classification requires understanding the incident's scope: which systems are affected, which business functions are impacted, how many customers are involved, and what data is at risk. This understanding comes from observability data — not from asking engineers to manually investigate.

Art. 19: Reporting Timeline

The incident reporting timeline under Art. 19 is demanding. For major incidents, initial notification must occur within the regulatory window. This is not achievable if the institution spends the first hours trying to understand what happened. Observability — particularly distributed tracing and correlated logging — provides the data needed to classify and report within timeline.

The Three Pillars of Observability for DORA

1. Structured Logging

Unstructured logs — free-text strings written to files — are the norm in many financial institutions. They are human-readable in isolation but machine-unparseable at scale. When an incident occurs and the institution needs to correlate events across 20 services, free-text grep commands across heterogeneous log formats are not a detection strategy. They are digital archaeology.

Structured logging means every log entry is a machine-parseable data structure (typically JSON) with consistent fields:

Field	Purpose	DORA Alignment
`timestamp`	Precise event timing (ISO 8601, UTC)	Art. 10 — prompt detection timeline
`correlation_id`	Links all log entries for a single transaction	Art. 17 — incident scope determination
`service`	Which service emitted the log	Art. 8 — asset identification in incidents
`level`	Severity (DEBUG/INFO/WARN/ERROR/FATAL)	Art. 17 — incident classification
`action`	What operation was performed	Art. 10 — anomaly detection baseline
`actor`	User or service identity	Art. 9 — access control audit trail
`entity_type` / `entity_id`	What business entity was affected	Art. 17 — impact assessment
`duration_ms`	Processing time	Art. 10 — performance anomaly detection
`status`	Success/failure/timeout	Art. 10 — error rate monitoring

2. Distributed Tracing

A distributed trace is a causal chain of operations across services. When a customer initiates a payment and the trace shows the request passing through API gateway → authentication → fraud check → sanctions screening → payment routing → core banking → confirmation, the institution has a complete picture of the transaction's path, timing, and outcome.

When that payment fails — at sanctions screening, for example — the trace shows exactly where, when, and why. This is the data that makes Art. 17 classification possible in minutes rather than hours.

Without distributed tracing, the institution observes a symptom: "payments are slow." With tracing, the institution identifies the cause: "sanctions screening latency has increased 10x, likely due to a list update that increased the dataset by 40%." The latter enables Art. 17 classification and Art. 10 detection. The former enables only panic.

3. Metrics and Alerting

Metrics are numerical measurements aggregated over time: request rate, error rate, latency percentiles, resource utilization, queue depth, connection pool saturation. They provide the statistical foundation for anomaly detection.

For DORA compliance, the critical metrics are the "golden signals":

Golden Signal	Metric	DORA Threshold	Alert Condition
Latency	P50, P95, P99 response time	Art. 7 — reliable systems	P95 > 2x baseline for 5 minutes
Traffic	Requests per second	Art. 10 — anomaly detection	> 3x or < 0.3x normal for time of day
Errors	Error rate (%)	Art. 10 — anomalous activities	> 1% for critical services
Saturation	CPU, memory, disk, connections	Art. 7 — sufficient capacity	> 80% sustained for 10 minutes

Alerting transforms metrics into action. But alerting without context produces noise. A spike in payment service error rate is an alert. A spike in payment service error rate correlated with a deployment 3 minutes ago, affecting only transactions routed through a specific payment processor, with the root cause visible in the distributed trace — that is actionable intelligence for Art. 17 classification.

Correlation IDs: The Connective Tissue

The correlation ID (or request ID, trace ID) is the single most important observability primitive for DORA compliance. It is a unique identifier generated at the entry point of a transaction and propagated through every service, log entry, database query, message queue event, and audit record that the transaction touches.

Without correlation IDs: the institution has logs from 20 services, each containing thousands of entries per minute, with no way to reconstruct which entries relate to a specific customer transaction, incident, or business function.

With correlation IDs: grep "correlation_id=txn-8847" across all services returns the complete transaction history in chronological order. Incident classification time drops from hours to minutes. Root cause analysis time drops from days to hours.

Correlation IDs are not optional for DORA compliance. They are the mechanism that makes Art. 10 detection, Art. 17 classification, Art. 19 reporting, and Art. 13 post-incident learning possible at the speed the regulation demands.

Implementation requirements:

Generated at the system boundary (API gateway, message consumer, scheduled job)
Propagated through HTTP headers (X-Correlation-ID), message metadata, database transaction context
Included in every log entry, audit event, and error message
Stored in incident records for post-incident analysis
Queryable across all observability backends (log aggregation, tracing, metrics)

Building DORA-Grade Observability

Architecture

The observability architecture for DORA compliance must support:

Centralized log aggregation from all ICT systems (not per-team, not per-application — unified)
Distributed tracing across service boundaries (including third-party integration points for Art. 28 monitoring)
Metric collection with retention aligned to DORA's 10-year audit requirement for significant incidents
Alerting with escalation paths that map to the institution's incident classification thresholds

Retention and Integrity

Observability data for DORA compliance is not ephemeral operational data. It is evidence. Log entries that record system behavior during an incident may be requested by supervisors months or years later. The EBA and national competent authorities expect that institutions can reconstruct the state of their systems at the time of an incident — from observability data.

This means:

Retention policies aligned with regulatory requirements (typically 5-10 years for incident-related data)
Integrity protection (append-only logging, tamper detection) to ensure log entries cannot be modified after the fact
Access control on observability data (who can view, who can query, who can delete)
Data classification of observability data itself (logs may contain customer PII, transaction amounts, or other sensitive data)

Third-Party Observability

DORA Art. 28 third-party risk management extends to observability. The institution must have visibility into the performance and availability of third-party ICT services. This requires:

SLA monitoring for third-party APIs (latency, availability, error rates)
Third-party incident detection through observability signals (increased error rates from a vendor API)
Contractual provisions (per Art. 30) requiring third-party providers to share relevant observability data or grant monitoring access

Without third-party observability, the institution cannot distinguish between an internal system failure and a third-party service degradation — a distinction that is critical for Art. 17 classification and Art. 28 risk assessment.

From Engineering Practice to Regulatory Requirement

The shift that DORA drives is cultural as much as technical. Observability has traditionally been owned by engineering teams as an operational tool — used for debugging, performance optimization, and on-call incident response. DORA makes observability a compliance function:

Art. 10 detection requires observability as the detection mechanism
Art. 17 classification requires observability data as the classification evidence
Art. 19 reporting requires observability-derived impact analysis
Art. 13 learning requires observability data for post-incident analysis
Art. 14 reporting requires observability-derived KPIs for management body reporting
Art. 24-27 testing requires observability to validate that resilience tests produce the expected outcomes

Institutions that treat observability as an engineering concern — funded from the IT operations budget, governed by engineering standards, reported to the CTO — will find that DORA requires it to be a compliance concern — funded from the resilience budget, governed by regulatory standards, and reported to the management body.

Key Takeaways

Correlation IDs are not optional. They are the mechanism that makes DORA Art. 10, 17, and 19 achievable within regulatory timelines. Every transaction must carry a correlation ID through every service it touches.
Structured logging is a regulatory requirement. Unstructured logs cannot be queried at scale, correlated across services, or analyzed for anomaly detection. DORA's detection timeline demands machine-parseable logs.
Distributed tracing enables rapid incident classification. Art. 17 requires classification of incidents — tracing shows the root cause, scope, and impact in minutes rather than hours.
Observability data is evidence. Retention, integrity, and access control must be treated with the same rigor as audit logs and financial records.
Third-party observability is required under Art. 28 — the institution must monitor third-party service quality as part of its ICT third-party risk management.

Resume en francais

L'observabilite — la capacite a inferer l'etat interne d'un systeme a partir de ses sorties — n'est plus une bonne pratique d'ingenierie sous DORA. C'est une exigence reglementaire implicite. L'article 10 impose la detection rapide des anomalies, ce qui necessite des metriques de reference, une surveillance en temps reel et des alertes automatisees. L'article 17 exige la classification des incidents, impossible sans tracage distribue et journalisation correlable. L'article 19 impose un reporting dans des delais serres qui ne permettent pas des heures d'investigation manuelle. Les identifiants de correlation sont l'element fondamental : un identifiant unique propage a travers chaque service, base de donnees et file de messages permet de reconstituer une transaction complete en quelques secondes. La journalisation structuree (JSON avec champs normalises) remplace les logs texte libres inexploitables a l'echelle. Les trois piliers — journalisation structuree, tracage distribue et metriques avec alertes — doivent etre deployes comme une infrastructure de conformite, pas comme un outil d'ingenierie. Les donnees d'observabilite sont des preuves reglementaires soumises aux memes exigences de retention, d'integrite et de controle d'acces que les journaux d'audit.