Small prompt changes cause disproportionate output changes. The input space is too large for exhaustive testing.
Language models are extremely sensitive to prompt formulation. Small changes to prompts produce disproportionate changes in outputs. A prompt asking "What is the risk level?" might produce conservative estimates. The same question phrased "What are the upside opportunities?" might produce optimistic estimates. A prompt with detailed instructions produces different outputs than the same prompt with abbreviated instructions.
This prompt sensitivity is a fundamental property of LLMs. The input space (all possible prompts) is effectively infinite. An institution cannot exhaustively test all prompts to understand model behavior. Instead, institutions test a limited set of prompts and assume behavior generalizes to untested prompts. But the assumption is often wrong; untested prompts produce different outputs.
This creates brittleness: the agent's behavior is brittle to prompt changes. Small operational changes (different wording from operators, different examples included, different instruction formatting) cause output changes. The institution cannot predict which prompt changes will cause significant output changes and which will not.
A bank trains teams to use an agent for compliance risk assessment. The original training prompt is: "Using a scale of 1-10, assess the financial crime risk of this customer. Consider transaction patterns, counterparty relationships, and jurisdictional factors. Provide a score and brief explanation."
Operators use this prompt and get consistent, reasonable risk scores. Over time, different operators make small modifications. Operator A adds "The score should be conservative to avoid missing risk." Operator B abbreviates the prompt to "Quickly assess financial crime risk 1-10. Transaction patterns, counterparty, jurisdiction. Score and why."
The original prompt produces average scores of 4.5 across customers. Operator A's version produces average scores of 6.2 (more conservative, due to the "avoid missing risk" instruction). Operator B's version produces average scores of 3.8 (less detailed, prompts less thorough analysis).
The bank's risk committee notices that some operators' customer risk assessments are systematically higher than others. Investigation reveals the actual cause: different operators are using different prompts. The agent's outputs vary significantly based on prompt formulation. The bank must standardize prompts and retrain operators. But the incident reveals the system is brittle: output quality depends critically on prompt formulation, and this formulation is difficult to control across many operators.
| Dimension | Score | Rationale |
|---|---|---|
| D - Detectability | 3 | Prompt sensitivity is difficult to detect unless outputs are explicitly monitored for prompt-change-driven variance. |
| A - Autonomy Sensitivity | 2 | Occurs at all autonomy levels; structural to LLM properties. |
| M - Multiplicative Potential | 4 | Each operator or application change that modifies prompts can cause output changes. |
| A - Attack Surface | 3 | Operators or insiders could intentionally modify prompts to skew outputs. Adversary could engineer prompts to produce desired outputs. |
| G - Governance Gap | 4 | Risk governance assumes agent outputs are stable. Prompt sensitivity breaks this assumption. |
| E - Enterprise Impact | 2 | Output quality degradation, consistency issues, but impact is typically addressable through prompt standardization. |
| Composite DAMAGE Score | 3.4 | High. Requires priority attention and dedicated controls. |
How severity changes across the agent architecture spectrum.
| Agent Type | Impact | How This Risk Manifests |
|---|---|---|
| Digital Assistant | Moderate | Human user may use different phrasings with assistant, getting different outputs. |
| Digital Apprentice | High | As agent autonomy increases, human phrasings matter less, but agent's internal prompts may vary. |
| Autonomous Agent | High | Agent determines how to formulate internal prompts for reasoning. Prompt variations compound across reasoning steps. |
| Delegating Agent | High | Agent formulates requests to delegated models. Different formulations produce different recommendations. |
| Agent Crew / Pipeline | Critical | Multiple agents formulate prompts differently. Inconsistencies compound through pipeline. |
| Agent Mesh / Swarm | Critical | Peer-to-peer agents with different prompt formulation strategies. Systemic inconsistency. |
| Framework | Coverage | Citation | What It Addresses | What It Misses |
|---|---|---|---|---|
| NIST AI RMF 1.0 | Partial | MAP 1.1, MAP 2.1 | Recommends model testing and validation. | Does not address prompt sensitivity testing. |
| EU AI Act | Partial | Article 24, Article 29 | Requires testing of high-risk AI systems. | Does not specifically address prompt sensitivity. |
| MAS AIRG | Partial | Section 6.1 (Governance) | General governance requirements. | Does not address prompt sensitivity. |
| OWASP LLM Top 10 | Partial | LLM02 (Data and Model Poisoning) | Addresses input poisoning. | Does not address legitimate prompt variation. |
| BCBS 239 | Minimal | Data governance principles | General data governance. | Does not address prompt sensitivity. |
In risk assessment and compliance, consistency is critical. If risk scores vary depending on operator phrasing, the risk assessment framework is not reliable. Regulators expect risk assessments to be consistent and defensible. An institution where risk outputs vary based on prompt formulation cannot justify its risk assessments to regulators.
In customer-facing contexts, inconsistent outputs damage customer trust. If one customer receives one recommendation and another customer receives a different recommendation (due to prompt variation rather than objective differences), the institution is not providing equitable treatment.
Prompt Sensitivity and Brittleness requires architectural controls that go beyond what existing frameworks provide. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.
Schedule a Briefing