AML Observability: Rethinking Transaction Monitoring as a Debuggable Compliance System
Jürgen Schiller García
PAPER · v1.6 · 2026-04-07 · human
Abstract
AML transaction monitoring systems are widely deployed across financial institutions, yet they often remain difficult to reconstruct, explain, and evaluate in production. This paper argues that a key limitation is not only insufficient detection logic, but insufficient system observability. It defines AML Observability as the capability to reconstruct detection-relevant decisions across the full processing lifecycle and reframes recurring AML control breakdowns as diagnosability gaps rather than isolated detection misses. The paper makes five linked contributions. First, it proposes a five-layer AML observability architecture and the Transformation Spiral as a sequencing model linking data governance, system observability, AI observability, and AI enablement. Second, it interprets public enforcement cases through an OBASHI-informed lens and derives a compact anti-pattern library, including Upstream Blindspot, Silent Transformation Drop, Threshold Suppression, and Broken Feedback Loop. Third, it formalizes a minimal event-centric proof of concept in Python built around an AMLTrace/AMLTraceEvent model defined as an algebraic 7-tuple with explicit query semantics for why_flagged, why_not_flagged, and what_changed (with pseudocode and O(|E|) complexity analysis), privacy-aware trace retention, and a design for handling stateful temporal dependencies. Fourth, it extends the same trace schema toward AI governance by embedding model-specific artifacts—including SHAP-based feature attribution, confidence distributions, GNN embedding vectors, and KL-divergence drift indicators—into the compliance trace as native Ω payloads queryable through the same diagnostic semantics. Fifth, it adds a production-oriented architecture sketch and a larger synthetic comparison between a monitoring-only baseline and an observable system. The larger evaluation processes 1,000 synthetic transactions, including 300 injected faults distributed across six fault families spanning the five-layer stack. Across the fault cases, the observable system improves mean diagnosis completeness from 0.25 to 1.00, failure-layer attribution accuracy from 0.17 to 1.00, and reduces explanation steps from 4.0 to 1.0 relative to the baseline. A sizing model based on the current PoC event structure further indicates an average raw telemetry footprint of roughly 3.38 GB per one million transactions, falling to about 2.05 GB under the selective retention policy. These results remain synthetic and do not const