R-MP-05 memory Model & Pipeline Interaction DAMAGE 3.2 / High

Inference Pipeline Disruption

Agent query patterns (bursts, recursive calls, parallel invocations) overwhelm inference infrastructure calibrated for human-scale patterns.

The Risk

ML inference infrastructure (GPUs, model serving systems, feature store APIs) is dimensioned based on expected query patterns. Humans interacting with systems typically generate queries at a certain rate and pattern: a human credit underwriter might score 20 loan applications per hour, with some variation. The infrastructure is sized to handle peak human demand.

Agents, however, can generate query patterns that are different from human patterns. An agent might score 1,000 loan applications per hour (100x human rate). An agent might invoke the model in parallel (submit 100 requests simultaneously instead of sequentially). An agent might query variations of the same request (score with feature X=100, X=101, X=102, etc.) to understand the model's sensitivity, generating a burst of otherwise similar queries.

When agent query patterns overwhelm inference infrastructure, the infrastructure becomes unresponsive. Other systems (humans, other agents) that depend on model inference are starved for resources. This is distinct from resource exhaustion in transaction systems; here, the bottleneck is model inference capacity.

How It Materializes

An insurance company's underwriting system includes a neural network model that predicts claim risk based on claim details. The model runs on a GPU inference cluster that is dimensioned to serve approximately 500 insurance adjusters, each submitting 10 claim risk scores per hour (5,000 total requests per hour).

The company deploys an agentic claims triage system. The agent is designed to process a claim in 10 seconds, which includes: retrieving claim details, scoring the claim with the risk model, and making a recommendation. The agent is deployed to process the daily backlog of 10,000 unreviewed claims.

The agent begins processing claims from the queue. Each claim requires a single inference request to the risk model. If the agent processes all 10,000 claims in an 8-hour shift, that is 1,250 requests per hour, about 25% of the designed capacity. This should be manageable.

However, the agent also has logic to understand model uncertainties. For each claim, the agent submits 5 additional scoring requests with perturbed inputs (feature values modified by +/- 5%). This is intended to estimate the model's confidence in its score. The agent is now generating 60,000 requests over 8 hours (7,500 per hour), which exceeds the designed capacity.

Additionally, the agent's queue-processing logic is parallelized: it processes 100 claims in parallel. Each parallel invocation generates 6 scoring requests (1 baseline + 5 perturbed). This creates a burst of 600 concurrent requests to the inference system every 10 seconds.

The inference GPU cluster becomes oversubscribed. Individual inference requests start queuing. Requests that were expected to return in 100ms now take 10 seconds. The human adjusters, who rely on the same inference system, experience slow response times. When a human adjuster clicks "score this claim," they wait 30 seconds for the response instead of the usual 1 second. Productivity plummets.

Additionally, the model serving system's memory footprint grows as queued requests accumulate. The system runs out of memory and crashes. The entire inference service goes offline. All claims (both agent-processed and human-processed) are unscored.

Under regulatory reporting requirements, the company must report all claims decisions within defined timeframes. The inference outage has delayed claims decisions by several hours. Policyholders experience delayed payouts. The company may be cited for failure to timely process claims.

DAMAGE Score Breakdown

Dimension	Score	Rationale
D - Detectability	3	High inference load is detectable through infrastructure monitoring, but distinguishing between agent load and legitimate human load requires understanding query patterns.
A - Autonomy Sensitivity	4	The risk manifests in agents that can generate high-volume or high-concurrency queries.
M - Multiplicative Potential	4	A single agent generating excessive load can disrupt inference for all other systems.
A - Attack Surface	4	Any agent that submits queries to inference infrastructure is exposed, especially if the agent is designed for high throughput or parallel processing.
G - Governance Gap	4	Infrastructure capacity planning typically does not account for agent-specific query patterns. Agent governance does not mandate that agents respect shared resource limits.
E - Enterprise Impact	4	Inference disruption affects all systems that depend on model inference, including business-critical decision systems. Service degradation and outages are likely.
Composite DAMAGE Score	3.2	High. Requires per-agent inference quotas and load testing before deployment.

Agent Impact Profile

How severity changes across the agent architecture spectrum.

Agent Type	Impact	How This Risk Manifests
manage_accounts Digital Assistant	Low	Human paces queries; human-scale load.
school Digital Apprentice	Medium	Limited autonomy; query rate is bounded.
smart_toy Autonomous Agent	High	Autonomous query generation; burst risk.
share Delegating Agent	High	Function calling enables high-concurrency inference requests.
groups Agent Crew / Pipeline	Critical	Multiple agents in parallel, each generating inference load.
account_tree Agent Mesh / Swarm	Critical	Peer-to-peer inference consumption.

Regulatory Framework Mapping

Framework	Coverage	Citation	What It Addresses	What It Misses
DORA Article 17	Partial	Operational resilience and system capacity	Capacity planning; peak load handling.	Agent-specific query patterns and capacity impact.
FFIEC Business Continuity	Partial	System availability and capacity planning	Availability; capacity.	Agent-induced capacity constraints.
Insurance Regulations	Partial	Claims processing timeliness (varies by jurisdiction)	Claims processing timeliness.	Inference disruption affecting claims processing speed.
GLBA Section 501	Partial	Safeguards including system availability	System security and availability.	Inference disruption and availability impact.

Why This Matters in Regulated Industries

Operational continuity is a regulatory requirement in financial services and insurance. Regulators expect institutions to maintain service availability and to process claims/transactions within defined timeframes. When inference infrastructure becomes unavailable or degraded due to agent-induced load, the institution fails to meet its operational requirements.

Additionally, inference disruption often affects multiple business functions simultaneously. If the inference system serves both credit decisions and fraud detection, an inference disruption affects both. Regulators investigating such incidents will assess whether the institution had proper capacity planning and load management controls.

Controls & Mitigations

architectureDesign-Time Controls

Before deploying an agent that will generate inference load at scale, conduct load testing. Simulate the agent's query patterns on the inference infrastructure and measure the impact. If the agent would consume more than 30% of capacity, require explicit approval.
Implement per-agent inference request quotas: each agent has a maximum number of inference requests per unit time. Requests exceeding the quota are queued with exponential backoff or rejected.
For agents that need to understand model uncertainty through perturbation, use a separate, smaller inference model trained to predict uncertainty rather than generating 5+ model invocations per request.

play_circleRuntime Controls

Deploy a query rate limiter at the inference API boundary. Each agent is rate-limited to a maximum number of concurrent requests and a maximum number of requests per second.
Implement dynamic scaling: when inference load from an agent exceeds expected levels, automatically slow down the agent (add delay between requests) or reduce the parallelism. This prevents sudden spikes from overwhelming the infrastructure.
Monitor the inference queue depth and latency. If queue depth exceeds a threshold, automatically identify the agent(s) generating the excess load and backpressure them.

monitoringDetection & Response

Maintain an inference load dashboard that shows load by source (human, agent, system). Escalate if an agent's load exceeds expected levels or if the agent's load is causing degradation for other systems.
When inference latency increases, investigate the load patterns to identify if an agent is generating unusual query patterns (high concurrency, high frequency, or unusual parameter variations).
Establish an inference disruption incident response procedure: if inference becomes unavailable, identify the root cause, stop the offending agent immediately, and roll back recent deployment changes.

Address This Risk in Your Institution

Inference Pipeline Disruption requires per-agent quotas, load testing, and capacity planning that accounts for agent query patterns. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.

Schedule a Briefing

Agentic AI Risk & Controls Workshop Our Methodology Regulatory Landscape