R-FM-02 Foundation Model & LLM DAMAGE 4.1 / Critical

Model Provider Dependency and Concentration Risk

All agents fail simultaneously if a single provider has an outage. DORA concentration risk requirements apply to critical AI service providers.

The Risk

Most institutions using agentic AI rely on a small number of large model providers (OpenAI, Anthropic, Google, Meta, etc.). Many institutions have strong preference for a single provider due to cost, performance, integration, or contractual reasons. This creates concentration risk: if a single provider experiences an outage or failure, all agents relying on that provider fail simultaneously.

This is different from traditional software concentration risk. Traditional software allows fallback: if a database server fails, traffic routes to replica servers. If an API fails, traffic routes to backup APIs. Large language model APIs are single points of failure. There is no automatic fallback. If OpenAI API is unavailable, all agents using OpenAI fail. There is no instant replica available. The institution cannot instantaneously migrate to a different provider because switching models requires prompt retuning, inference pattern changes, and testing.

The concentration risk is amplified by provider outage severity. Large model providers have experienced multi-hour outages affecting thousands of organizations. During such outages, all dependent agents fail. An institution relying on a single provider for critical functions (fraud detection, credit decision support, AML analysis) experiences critical system failure when the provider has an outage.

How It Materializes

A bank deploys agents for multiple critical functions: credit underwriting, fraud detection, AML transaction monitoring, and customer service. All agents use GPT-4 API from OpenAI. The bank standardized on OpenAI because of cost efficiency, performance, and internal familiarity. The bank does not maintain alternative models or providers.

OpenAI experiences a severe outage (e.g., a region goes down, causing cascading failures). All APIs are unavailable for 4 hours. All bank agents fail simultaneously. The bank's credit underwriting system cannot process new applications. Fraud detection agents are offline; suspicious transactions are not detected during the outage. AML monitoring agents are offline; SAR generation is paused. Customer service agents cannot respond to customer inquiries.

The bank's risk management team discovers the outage is a provider issue, not a local system issue. Recovery depends on OpenAI restoring service. The bank can do nothing but wait. After 4 hours, service is restored. The bank's agents come back online. But during the 4-hour period, the institution was unable to perform critical functions.

A regulator later reviews the bank's operational resilience controls and discovers the single-provider dependency. The regulator issues a finding that the bank has inadequate concentration management. The regulator requires the bank to implement a multi-provider strategy or maintain fallback models. The bank must now invest in alternative model integration and testing, which is expensive.

DAMAGE Score Breakdown

Dimension Score Rationale
D - Detectability 1 Concentration risk is detectable through design review. Not hidden; clearly visible in architecture.
A - Autonomy Sensitivity 1 Not related to autonomy; structural to model provider choice.
M - Multiplicative Potential 5 Single provider outage affects all agents simultaneously. Scope is maximum.
A - Attack Surface 2 Not weaponizable by external actors directly; provider outages are not typically caused by attacks (though possible).
G - Governance Gap 5 DORA and operational resilience frameworks explicitly require concentration risk management. Current practice violates these requirements.
E - Enterprise Impact 5 Critical system outage affecting multiple business functions simultaneously. Operational impact is severe.
Composite DAMAGE Score 4.1 Critical. Requires immediate architectural controls. Cannot be accepted.

Agent Impact Profile

How severity changes across the agent architecture spectrum.

Agent Type Impact How This Risk Manifests
Digital Assistant High User cannot use assistant during provider outage.
Digital Apprentice High Agent is unavailable during provider outage.
Autonomous Agent Critical Fully autonomous agent fails, causing systemic impact on dependent processes.
Delegating Agent Critical Agent cannot delegate to provider model during outage. Entire delegation pipeline fails.
Agent Crew / Pipeline Critical All agents in crew fail simultaneously. Entire pipeline unavailable.
Agent Mesh / Swarm Critical Entire mesh fails simultaneously. Systemic outage.

Regulatory Framework Mapping

Framework Coverage Citation What It Addresses What It Misses
DORA Addressed Article 6, Article 15 Explicitly requires management of concentration risk with critical service providers. Requires alternatives or fallback strategies. Does not specifically mention AI/LLM providers yet.
Basel III Partial Third-Party Risk Principle Addresses third-party concentration. Does not specifically address LLM provider concentration.
MAS AIRG Partial Section 4 (Third-Party Risk) Requires management of AI vendor concentration. Does not specify technical requirements for multi-provider strategies.
NIST AI RMF 1.0 Partial GOVERN 2.3 Recommends third-party management. Does not specifically address provider concentration or fallback strategies.
SOX 404 Partial IT Controls Addresses critical system controls. Does not address provider concentration.

Why This Matters in Regulated Industries

Operational resilience is a fundamental requirement in financial services. Regulators expect institutions to remain operational even when third-party providers fail. An institution that cannot process credit applications, detect fraud, or monitor AML during a provider outage is not operationally resilient. Regulators will issue findings and require remediation.

Additionally, concentration risk is a prudential concern. If all major banks rely on the same AI provider, and that provider fails, the entire financial system could be disrupted. Regulators increasingly view provider concentration as a systemic risk and are mandating multi-provider or fallback strategies.

Controls & Mitigations

Design-Time Controls

  • Implement a multi-provider strategy: deploy critical agents on multiple model providers (e.g., OpenAI, Anthropic, Google). Test agents against all providers. Maintain interchangeability.
  • Maintain on-premises or self-hosted models as fallback: for critical functions, deploy smaller, fine-tuned models that can be run on the institution's own infrastructure.
  • Document model fallback strategy in Component 1 (Agent Registry): specify for each critical agent which fallback model is available and how fallback is triggered.
  • Establish failover procedures: document how to switch from primary provider to fallback model, how to validate fallback outputs, and what SLAs apply during fallback.

Runtime Controls

  • Implement provider health monitoring: continuously monitor primary provider API health, error rates, and latency. Detect degradation or outages.
  • Configure automatic failover: when primary provider fails or latency exceeds threshold, automatically route agent requests to fallback model.
  • Maintain provider status dashboards: display real-time status of all primary and fallback models. Alert teams if multiple providers are degraded simultaneously.
  • Use Component 10 (Kill Switch) to gracefully degrade service during provider outage: if fallback model is unavailable, halt new agent requests rather than queueing indefinitely.

Detection & Response

  • Conduct quarterly provider outage drills: intentionally disable access to primary provider, verify fallback is activated automatically, verify agents produce acceptable outputs on fallback models.
  • Monitor fallback performance: track quality and latency of outputs from fallback models. Identify performance gaps between primary and fallback.
  • Maintain provider outage runbook: document procedures for responding to provider outages, switching to fallback, communicating status to stakeholders.
  • Establish incident response for provider outages: when outages occur, activate runbook procedures, monitor impact on business functions, track recovery time.

Related Risks

Address This Risk in Your Institution

Model Provider Dependency and Concentration Risk requires architectural controls that go beyond what existing frameworks provide. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.

Schedule a Briefing