R-OR-06 dns Operational Resilience DAMAGE 4.2 / Critical

Cascading Infrastructure Failure

Agent delegation creates undocumented dependency chains; failure cascades through systems with no documented dependency relationship; blast radius exceeds disaster recovery planning.

The Risk

In traditional infrastructure, dependencies are documented and modeled. System A depends on System B; System B depends on Database C. Disaster recovery planning is based on this dependency map. When agents dynamically invoke APIs and delegate to other systems through function calling or message queues, they create dependencies that may not be explicitly documented in the infrastructure's dependency map.

When one system in the chain fails, the agent may be unaware of the failure (if it is not immediately returned as an error code). The agent continues operating, queuing requests that accumulate. Dependent systems waiting for responses stall. The failure cascades beyond what the infrastructure team anticipated. Disaster recovery planning, which is based on documented dependencies, does not account for the undocumented chains created by agents. The organization discovers, during a failure event, that the actual blast radius is much larger than the documented blast radius.

How It Materializes

A major insurance company operates a claims processing platform built on microservices. The infrastructure team has documented these dependencies and created a disaster recovery plan. The company deploys an agentic workflow manager to accelerate claims throughput. The agent is authorized to invoke the claims entry API, then the claims validation API, then conditionally the medical review API, then the payment API. Additionally, the agent invokes a third-party data enrichment API and writes decision logs to an external compliance archive.

One morning, the third-party data enrichment API experiences an outage. The agent retries 3 times, consuming 15 seconds. After the third retry, the agent proceeds to the validation step, which requires enrichment data. The validation fails. The agent retries the entire sequence, creating a loop. Meanwhile, the agent's decision log writes are queuing up to the compliance archive, which receives thousands of logs per minute and becomes unresponsive.

The archive failure cascades to the customer communication service, which depends on the archive for audit trail verification. The communication service stops sending notifications. The infrastructure team's disaster recovery plan does not mention the enrichment API or the communication service as dependencies of the claims platform. The incident requires emergency escalation to identify and stop the agent, rebuild the compliance archive, and manually notify customers. The incident duration is 2 hours, far exceeding the organization's RTO of 15 minutes.

DAMAGE Score Breakdown

Dimension	Score	Rationale
D - Detectability	5	Cascading failures are difficult to detect because the root cause (agent-induced load on an external system) is indirect and not monitored. Standard infrastructure monitoring may not correlate the agent's actions with the cascade.
A - Autonomy Sensitivity	5	The risk manifests in autonomous agents that create dynamic dependencies through function calling or delegation.
M - Multiplicative Potential	5	A single agent error (retrying a degraded API) can cascade across multiple systems exponentially.
A - Attack Surface	5	Any agent with access to multiple APIs or systems can create the undocumented dependency chain.
G - Governance Gap	5	Standard infrastructure dependency mapping does not account for agent-dynamic dependencies. DR planning is based on static dependencies and does not account for agent-induced chains.
E - Enterprise Impact	5	Cascading failure can affect all customer-facing services and require extended outages to remediate.
Composite DAMAGE Score	4.2	Critical. Requires immediate architectural controls. Cannot be accepted.

Agent Impact Profile

How severity changes across the agent architecture spectrum.

Agent Type	Impact	How This Risk Manifests
manage_accounts Digital Assistant	Low	Human operator sequences tool invocations and may avoid cascade-prone patterns.
school Digital Apprentice	Medium	Limited autonomy; cascade chains are shorter.
smart_toy Autonomous Agent	High	Autonomous tool invocation creates dynamic dependency chains.
share Delegating Agent	Critical	Function calling and dynamic API invocation are core to the design; cascades are likely.
groups Agent Crew / Pipeline	Critical	Multiple agents in sequence can create compound dependency chains.
account_tree Agent Mesh / Swarm	Critical	Peer-to-peer delegation creates unpredictable dependency topologies.

Regulatory Framework Mapping

Framework	Coverage	Citation	What It Addresses	What It Misses
DORA Article 17	Relevant	Operational Resilience	Scenario analysis; testing of cascading failures.	Agent-induced cascades distinct from infrastructure failures.
FFIEC Business Continuity	Relevant	Dependency Mapping	Dependency mapping; recovery planning.	Agent-dynamic dependencies not captured in static maps.
MAS TRM Guidelines	Relevant	Technology Risk Management	Resilience testing; failure scenario analysis.	Agent-centric cascading failures.
GLBA Section 501	Relevant	Safeguarding Rule	Security controls; operational continuity.	Agent-induced operational failures.
ISO 42001	Minimal	Section 8.1	AI system governance; planning.	Agent-induced dependency chains and cascading failures.
NIST CSF 2.0	Partial	Govern and Protect Functions	Dependency identification; failure recovery.	Agent-dynamic dependency chains.

Why This Matters in Regulated Industries

Regulators expect institutions to understand their operational infrastructure and to plan for failure scenarios. When a cascading failure occurs that was not anticipated in the DR plan, regulators ask whether the organization understood the true dependencies in its systems. If an agent created an undocumented dependency chain that the organization did not know about, regulators cite this as a governance failure.

The regulatory consequence is that the organization must revise its dependency mapping, revise its DR plan, and possibly revise its operational risk capital allocation. If the cascade caused customer impact or service disruption beyond the documented RTO, regulators may impose remedial action orders or increased capital requirements.

In financial services, cascading failures can affect settlement, clearing, and payment systems, which are critical to the broader financial system. Regulators (Federal Reserve, OCC) take infrastructure resilience seriously and expect institutions to be able to articulate the recovery scenario for every system in their operational footprint.

Controls & Mitigations

architectureDesign-Time Controls

Before deploying any agent with function calling or multi-API access, conduct a comprehensive dependency analysis. Map all APIs the agent can invoke, trace dependencies, and identify potential cascades. Any cascade scenarios must be explicitly designed out or subject to human confirmation.
Implement a "dependency register" for agent-created dynamic dependencies. Every API and its transitive dependencies must be registered. Before the agent is authorized to call an API, the cascading impact is assessed.
Establish a hard limit on the depth of dependency chains the agent can create. Recommend max depth of 3. Any deeper chains must be explicitly approved by the infrastructure team.

play_circleRuntime Controls

Deploy a cascading failure detector at the infrastructure boundary. Monitor the correlation between agent API invocations and downstream system loads. If a spike follows an agent's actions, trigger an alert.
Implement adaptive rate limiting that responds to downstream system health. If a downstream system's latency exceeds a threshold, automatically reduce the rate of agent requests to the upstream API.
Maintain a "system health" API that agents query before invoking downstream systems. If any dependent system is degraded, the agent is instructed to back off and retry later.

monitoringDetection & Response

Implement infrastructure monitoring that explicitly tracks cascading failures. When System A degrades, immediately check which other systems depend on it. Escalate if the cascade is unexpected.
Maintain an incident log of all cascades detected. Use this to identify gaps in the dependency model and update the DR plan accordingly.
Conduct quarterly "dependency audits" where the infrastructure team validates that all documented dependencies are accurate. Use agent behavior logs to discover undocumented dependencies and update the register.

Related Risks

Address This Risk in Your Institution

Cascading Infrastructure Failure requires architectural controls that go beyond what existing frameworks provide. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.

Schedule a Briefing

Agentic AI Risk & Controls Workshop Our Methodology Regulatory Landscape