R-MP-02 Model & Pipeline Interaction DAMAGE 3.3 / High

Model Version Conflict

Agents invoke "the credit model" without specifying version; some get v1, some get v2; same decision uses incompatible model versions.

The Risk

Model versioning is fundamental to model governance. When a model is updated (retrained with new data, retrained with new features, or recalibrated), the new version is validated and deployed incrementally, often with A/B testing to ensure the new version performs as expected. During the transition period, both model versions may be in production (v1 for some customers, v2 for others). Once v2 is validated, v1 is deprecated.

When agents request "the credit model" without specifying a version, the system returns the default version. But if the system is in transition (v2 is being deployed but v1 is still active), different requests at different times may hit v1 or v2. An agent that makes sequential calls to the credit model (score the customer, retrieve historical scores, compare) might get v1 for the first call and v2 for the second call. The agent's logic expects both scores to be comparable (e.g., "if historical average is 650 and current score is 700, there is an improvement"). But if historical scores are from v1 and the current score is from v2, and the models have different score distributions, the comparison is invalid.

Additionally, model version conflicts can occur when agents collaborate with other agents or systems. Agent A scores the customer with v1; Agent B retrieves the score and feeds it to another model trained to expect v1 scores. But the score returned by Agent A is actually v2 (because the system defaulted to v2 for that call), and Model B's performance degrades because it is receiving out-of-distribution inputs.

How It Materializes

A major bank's fraud detection system includes a neural network model that identifies suspicious transactions. The model has been in production for 18 months (version 1). The fraud detection team develops version 2, which incorporates a new data source (geolocation enrichment) and uses a different architecture (transformer instead of LSTM). Version 2 has slightly higher AUC on the test set.

The bank begins deploying version 2 in a canary deployment: 5% of traffic gets v2, 95% gets v1. The model API returns v2 for 5% of requests (based on a hash of the transaction ID) and v1 for the rest.

The bank deploys an agentic fraud review system to expedite the manual review of flagged transactions. The agent is instructed to: (1) score the transaction with the fraud model, (2) retrieve the customer's recent fraud score history from the data warehouse, (3) compare the current score to the historical average, and (4) if the current score is significantly higher than average, recommend escalation to human review.

For most transactions, the agent hits v1 of the fraud model and scores the transaction. But for some transactions (the 5% canary sample), the agent hits v2 and gets a different score. The issue arises in step 2-3: the agent retrieves the customer's fraud score history from the data warehouse. This history contains scores from v1 (the model that was in production for the past 18 months). The agent compares the current score (which might be v2) to the historical average (which is all v1).

The models have different score distributions: v1 returns fraud scores in the 0-100 range; v2 returns scores in the 0-1000 range (due to the transformer model's hidden dimension size). A transaction that gets a v2 score of 250 is comparable to a v1 score of 25 (both indicate low fraud risk). But the agent's comparison logic is naive: it compares the numeric value directly. A v2 score of 250 is compared to a v1 average of 40, concluding that the transaction's fraud risk has increased 6-fold.

The agent escalates the transaction for manual review. The fraud analyst reviews the transaction and sees a fraud score of 250 (from v2) and historical average of 40 (from v1), agrees with the agent's assessment, and escalates the transaction as high-risk. The customer's bank declines the transaction. The customer is blocked from making a legitimate purchase.

Under consumer protection regulations (Fair Credit Reporting Act, EFTA), the bank must ensure that automated decisions affecting customers are accurate and fair. A transaction decline based on a model version mismatch is neither accurate nor fair.

DAMAGE Score Breakdown

Dimension Score Rationale
D - Detectability 3 Model version conflicts are detectable by logging the version used for each call, but many systems do not do this. If both versions produce similar distributions, conflicts go undetected.
A - Autonomy Sensitivity 4 The risk manifests in agents that make decisions based on model versions without explicit version awareness.
M - Multiplicative Potential 3 Version conflicts affect individual decisions, not cascades. But each conflicted decision is incorrect.
A - Attack Surface 5 Any agent that calls models without specifying version is exposed. Most agent-model integrations do not mandate version specification.
G - Governance Gap 5 Model governance typically manages versions at the model level. But agent governance does not mandate that agents specify model versions. The agent team and model team do not coordinate on version transitions.
E - Enterprise Impact 4 Model version conflicts result in incorrect decisions that affect customers (transaction declines, credit decisions, etc.). Regulatory and reputational impact.
Composite DAMAGE Score 3.3 High. Requires version pinning controls and model-agent coordination protocols.

Agent Impact Profile

How severity changes across the agent architecture spectrum.

Agent Type Impact How This Risk Manifests
Digital Assistant Low Human review can catch obvious score incompatibilities.
Digital Apprentice Medium Limited model consumption; version conflicts affect narrow scope.
Autonomous Agent High Autonomous comparison of current and historical scores without version awareness.
Delegating Agent High Multiple API calls to the model without explicit version pinning.
Agent Crew / Pipeline Critical Agents in sequence, each potentially hitting different versions.
Agent Mesh / Swarm Critical Peer-to-peer model consumption across versions.

Regulatory Framework Mapping

Framework Coverage Citation What It Addresses What It Misses
FCRA Partial Accuracy and fairness of automated credit decisions Credit decision accuracy; dispute resolution. Model version mismatch in credit decisions.
EFTA Partial Accuracy of electronic funds transfers and transaction authorization Transaction authorization accuracy; consumer protection. Model version mismatch in fraud detection decisions.
SR 11-7 Partial Model governance, validation, and version control Model versioning; deployment procedures. Agent awareness of model versions.
MAS AIRG Partial Domain 6: Model and Data Risk Management Model governance; versioning. Agent-model version coordination.
GDPR Article 22 Partial Accuracy and fairness of automated decision-making Automated decision accuracy. Model version mismatch and decision inaccuracy.

Why This Matters in Regulated Industries

In financial services, every automated decision creates a record and a potential regulatory exposure. When a decision is based on a model version mismatch (agent compares scores from different models), the decision is potentially inaccurate. Regulators investigating the decision will ask: "Did the institution ensure that all inputs to the decision came from compatible models?" If the answer is no, regulators cite inadequate governance.

Additionally, consumer protection laws like FCRA and EFTA give consumers rights to dispute automated decisions. If a consumer disputes a transaction decline or credit decision based on a model version mismatch, the bank must be able to demonstrate that the decision was based on accurate information. A model version mismatch is not "accurate information."

Controls & Mitigations

Design-Time Controls

  • Implement a model pinning policy: any agent that calls a model must explicitly specify the model version. The agent's code is not allowed to call "the credit model"; it must call "credit_model:v2".
  • Establish a model version compatibility matrix: for each model, document which versions are compatible with which other versions. Validate that the agent's logic only compares compatible versions.
  • Implement a model version transition protocol: when a new version is deployed, establish a transition period with clear rules about which version should be used for new transactions vs. historical comparisons.

Runtime Controls

  • Implement model versioning in the model API response: every response includes the model version used. The agent can then check the version and handle mismatches.
  • Deploy a version mismatch detector: if an agent attempts to compare scores from different model versions, the system detects this and either converts the scores to a common scale or blocks the comparison and escalates to human review.
  • Implement model version monitoring: track which model versions are being used by which agents. If an agent is using a deprecated version, alert the agent governance team and force the agent to upgrade.

Detection & Response

  • Maintain a model version usage dashboard: show which agents are using which model versions. Identify agents using deprecated versions and force them to upgrade.
  • Log all model version comparisons: when an agent compares scores from different models, log the versions used and the comparison result. Use this to detect invalid comparisons.
  • Implement a decision audit process: when a customer disputes a decision, check whether the decision involved a model version mismatch. If so, reverify the decision using the current production model version.

Related Risks

Address This Risk in Your Institution

Model Version Conflict requires version pinning controls and coordination protocols between model and agent teams. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.

Schedule a Briefing