R-DG-10 Data Governance & Integrity DAMAGE 3.8 / High

Synthetic Data Provenance Loss

Agent-generated synthetic data becomes structurally identical to system-of-record data. BCBS 239 traceability and regulatory reporting integrity are compromised when provenance markers are lost.

The Risk

BCBS 239 and regulatory reporting require all data to be traceable to system-of-record sources. Synthetic data can be used for testing, backtesting, and gap-filling, but it must always be clearly marked as synthetic, never commingled with system-of-record, and excluded from regulatory reports. Agents generate synthetic data continuously: forecasts, risk scores, imputed values, generated scenarios. When this synthetic data is stored alongside system-of-record data without clear provenance markers, the distinction is lost. Downstream users cannot tell which data is real and which is synthetic.

This creates a critical risk for regulatory reporting. A risk model trained partially on synthetic agent-generated data produces risk measures. Those risk measures are used to compute regulatory capital requirements. The regulator receives a capital report that is implicitly derived from synthetic data, but the report does not disclose this. The regulator believes the report is based entirely on system-of-record data. The institution has submitted a regulatory report of unknown accuracy because the provenance of underlying data is obscured.

The worst-case scenario is recursive: agent generates synthetic data; synthetic data is stored; synthetic data is used as input to future agents; future agents generate synthetic data from synthetic data; provenance is completely lost. The synthetic data may now be multiple generations removed from any system-of-record source. The institution cannot reconstruct whether any given risk measure is based on observed data or on multi-generational synthetic data.

How It Materializes

A large bank uses agents to impute missing market data in its risk data warehouse. For certain emerging market currencies, historical price data is sparse (trading is infrequent, data vendors have gaps). The bank's risk models require daily price data for capital calculations. An agent is deployed to impute missing data points by reasoning about economic fundamentals, related currency pairs, and historical volatility patterns. The agent generates synthetic prices for missing days.

The synthetic prices are stored in the risk warehouse with timestamps and source identifiers, but without clear provenance markers indicating they are synthetic. Risk analysts use the warehouse to populate risk models. The risk models feed into capital requirement calculations. The capital calculations are reported to regulators. Regulators do not know that some prices in the warehouse are synthetic. The bank does not explicitly disclose synthetic data in its regulatory report.

Three years later, an auditor reviews the capital calculation methodology. The auditor discovers that prices for 15% of trading days in certain currency pairs are synthetic (agent-imputed). Those synthetic prices were used in risk models. Those models drove capital calculations. The bank's capital ratios are based partly on synthetic data. Regulators are alarmed. The bank's regulatory reporting is now suspect. The regulator demands recalculation of all capital measures using only system-of-record prices. The bank discovers it cannot easily reconstruct which data in the warehouse is synthetic because it did not maintain clear provenance. The audit takes months. The bank's credibility with regulators is damaged.

DAMAGE Score Breakdown

DimensionScoreRationale
D - Detectability4Synthetic data provenance loss is difficult to detect because synthetic data looks identical to system-of-record data. Discovery occurs through audit or regulatory review.
A - Autonomy Sensitivity5Autonomous agents generate and store synthetic data without human verification. Provenance is not established during generation.
M - Multiplicative Potential5Synthetic data used as input to future agents generates multi-generational synthetic data. Provenance is exponentially obscured.
A - Attack Surface2Primarily a structural issue; not easily weaponized externally. Occurs naturally through normal agent synthetic data generation.
G - Governance Gap5BCBS 239 and regulatory reporting frameworks assume all data is traceable to system-of-record. Provenance loss breaks this assumption fundamentally.
E - Enterprise Impact5Regulatory reporting integrity compromised, potential regulator enforcement, restatement of risk metrics, reputational damage.
Composite DAMAGE Score3.8High. Requires priority attention with dedicated controls and monitoring.

Agent Impact Profile

How severity changes across the agent architecture spectrum.

Agent TypeImpactHow This Risk Manifests
Digital AssistantHighEven with human review, human cannot know if future synthetic data will be distinguishable from system-of-record.
Digital ApprenticeCriticalProgressive autonomy means synthetic data generation increases. Provenance markers get lost over time.
Autonomous AgentCriticalFully autonomous synthetic data generation without human awareness of provenance implications.
Delegating AgentCriticalAgent determines which data sources to invoke and synthesize. May delegate synthesis to tools, losing provenance tracking.
Agent Crew / PipelineCriticalMultiple agents generate synthetic data in sequence. Provenance is lost at each step. Final outputs are multi-generational synthetic.
Agent Mesh / SwarmCriticalPeer-to-peer agent network with continuous synthetic data generation across mesh. Provenance is completely opaque.

Regulatory Framework Mapping

FrameworkCoverageCitationWhat It AddressesWhat It Misses
BCBS 239CompletePrinciples 3, 5, 6, 8Requires system-of-record data, lineage, and accuracy.Does not address synthetic data provenance in agent environments.
Basel IIIPartialRisk Data AggregationRequires accurate risk data for capital calculations.Does not address synthetic data in risk data warehouses.
NIST AI RMF 1.0PartialGOVERN 2.1Recommends data provenance and quality frameworks.Does not address synthetic data provenance loss.
EU AI ActPartialArticle 24 (Documentation)Requires documentation of data sources and quality.Does not explicitly address synthetic data provenance.
MAS AIRGModerateSection 6.1Requires data governance and regulatory reporting integrity.Does not address synthetic data provenance.
Gramm-Leach-Bliley ActPartialRegulation Y/HRequires accurate financial information and records.Does not address synthetic data in regulatory reports.
FDIC GuidancePartialGuidance on SR 11-7Addresses model validation and data governance for capital models.Does not explicitly address synthetic data in risk data warehouses.
Sarbanes-Oxley 404PartialIT ControlsRequires control over financial reporting systems.Does not address synthetic data provenance.

Why This Matters in Regulated Industries

Capital requirements and risk measures are the foundation of regulatory oversight in banking. If those measures are based on synthetic data that is not clearly identified, regulators lose confidence in the institution's risk reporting. BCBS 239 enforcement has become a focus for regulators globally because accurate risk data is fundamental to prudential supervision. An institution that comingles synthetic and system-of-record data without clear provenance markers will face regulatory enforcement.

In insurance, loss reserves and capital calculations depend on accurate loss data. Synthetic loss scenarios can be valuable for stress testing, but they must be clearly distinguished from actual losses. If synthetic data is used to compute reserves without disclosure, the reserve integrity is compromised. In securities trading, synthetic price data can affect pricing models. If pricing models use synthetic data without disclosure, price discovery mechanisms are compromised. The regulatory reporting impact is particularly severe because regulators use reported data to make prudential decisions.

Controls & Mitigations

Design-Time Controls

  • Implement a "synthetic data quarantine" architecture: all agent-generated synthetic data is stored in separate databases, schemas, or logical partitions that are never commingled with system-of-record data.
  • Require all agents that generate synthetic data to attach immutable provenance markers using Component 2 (Cryptographic Identity): every synthetic data point includes a cryptographic signature, generation timestamp, methodology hash, and clear "SYNTHETIC" label.
  • Establish a synthetic data registry: document all agents authorized to generate synthetic data, the types they may generate, and the data categories they may not synthesize.
  • Prohibit any synthetic data from being used in regulatory reporting, capital calculations, or system-of-record data feeds without explicit governance review and approval.

Runtime Controls

  • Implement immutable labeling for synthetic data: every synthetic record includes a "data type: SYNTHETIC" field that cannot be modified or removed. Propagate this label through all downstream systems.
  • Require data lineage tracking for all synthetic data: log the agent that generated it, the system-of-record sources used (if any), the methodology, and the timestamp.
  • Implement synthetic data watermarking: embed cryptographic watermarks in synthetic data that identify it as synthetic even if metadata is stripped. Use Component 2 (Cryptographic Identity).
  • Use Component 10 (Kill Switch) to automatically halt any agent whose synthetic data is accessed by regulatory reporting systems or capital model systems.

Detection & Response

  • Conduct quarterly synthetic data audits: query data warehouses to identify all synthetic data, verify it is clearly labeled and segregated, verify it is not being used in regulatory reporting or capital calculations.
  • Implement synthetic data provenance validation: sample synthetic data, reconstruct its generation lineage, verify provenance metadata is complete and accurate.
  • Monitor data lineage for regulatory models: for every model used in capital calculations or regulatory reporting, audit the data sources. Detect any synthetic data sources.
  • Establish synthetic data incident response: if synthetic data provenance loss is discovered or synthetic data is found in regulatory reports, immediately audit extent, restate affected reports, notify regulators.

Related Risks

Address This Risk in Your Institution

Synthetic Data Provenance Loss requires architectural controls that go beyond what existing frameworks provide. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.

Schedule a Briefing