R-RE-12psychology Reasoning & EpistemicDAMAGE 4.0 / Critical

World Model Misalignment

Agent constructs an internal representation of how the institution operates that reflects a generic financial institution from training data rather than this specific institution.

The Risk

An agent's internal model of how the world works (what the organization does, how it makes decisions, what it cares about) is learned from training data and shaped by prompts and examples. However, this learned world model may not match actual institutional reality. The organization may have unique processes, cultural norms, regulatory commitments, or risk appetites that do not match the generic patterns in the training data.

When the agent encounters a situation not covered by its training or examples, it falls back on generic patterns learned from public data. These generic patterns may be completely misaligned with the institution's actual approach.

This is fundamentally agentic because agents are trained on general data and must operate in specific institutional contexts. The larger the gap between the agent's training data and the institution's actual operations, the larger the risk.

How It Materializes

A bank trains an agent on public news, articles, and regulatory documents to assist in understanding regulatory changes. The agent is trained to recognize regulatory patterns and to assess risk implications.

A new regulation is issued that is novel and not covered in the agent's training data. The regulation is somewhat ambiguous in its application to the bank's specific business model. The agent, lacking explicit guidance from its training data, falls back on generic patterns: "regulators typically interpret ambiguous rules in favor of consumer protection; therefore this regulation probably requires the most consumer-protective interpretation."

However, the bank's world model is different. Based on the bank's experience with this regulator and prior regulatory engagement, the bank's interpretation is more nuanced: "the regulator is willing to accept reasonable interpretations aligned with the bank's business model; overly conservative interpretation would undermine the bank's competitiveness without providing additional safety."

The agent's recommendation (overly conservative interpretation) conflicts with the bank's world model (balanced interpretation aligned with business model). If the agent is deployed in a scenario where override is not possible, the agent might produce decisions that are misaligned with the bank's actual priorities.

DAMAGE Score Breakdown

Dimension	Score	Rationale
D - Detectability	3	World model misalignment is invisible until agent behaves in misaligned way.
A - Autonomy Sensitivity	4	Agent operates autonomously from learned world model.
M - Multiplicative Potential	4	Impact scales with how frequently agent operates in scenarios not covered by training.
A - Attack Surface	5	World model learning from general training data creates the vector.
G - Governance Gap	5	No standard framework requires agents to model institutional world model.
E - Enterprise Impact	2	Operational decisions misaligned with risk appetite, requirement to implement institutional context training.
Composite DAMAGE Score	4.0	Critical. Requires immediate architectural controls. Cannot be accepted.

Agent Impact Profile

How severity changes across the agent architecture spectrum.

Agent Type	Impact	How This Risk Manifests
manage_accounts Digital Assistant	Low	Human understands institutional world model.
school Digital Apprentice	Medium	Apprentice governance includes institutional context training.
smart_toy Autonomous Agent	High	Agent operates from generic training data world model.
share Delegating Agent	High	Agent invokes tools from misaligned world model.
groups Agent Crew / Pipeline	Critical	Multiple agents trained on generic data; institutional world model is not shared.
account_tree Agent Mesh / Swarm	Critical	Agents trained independently; institutional world model is fragmented across agents.

Regulatory Framework Mapping

Framework	Coverage	Citation	What It Addresses	What It Misses
NIST AI RMF 1.0	Partial	MAP.2	Recommends understanding AI system context and limitations.	Does not address institutional world model alignment.
SR 11-7 / MRM	Partial	Model Risk Management (Section 2)	Expects models to be validated in organizational context.	Does not address world model alignment.

Why This Matters in Regulated Industries

In banking and financial services, institutional context matters enormously. A bank's world model includes its risk tolerance, its relationships with regulators, its competitive strategy, and its cultural values. Agents that operate from a misaligned world model will make decisions that sound reasonable on the surface but are out of step with institutional reality.

Controls & Mitigations

architectureDesign-Time Controls

Implement institutional context training: train agents on institutional data (internal policies, past decisions, regulatory engagement history) to learn the institutional world model.
Implement explicit world model documentation: document the institution's world model, risk appetite, regulatory engagement approach, and decision-making priorities.
Implement world model alignment testing: test the agent against scenarios to verify that its recommendations align with institutional world model.

play_circleRuntime Controls

Monitor for world model misalignment: detect when agent recommendations diverge significantly from institutional norms or past decisions.
Implement override tracking: when humans override agent recommendations due to world model misalignment, log these overrides and use them to improve institutional context training.
Flag novel scenarios: when the agent encounters situations not covered by institutional context, escalate for human decision rather than falling back on generic patterns.

monitoringDetection & Response

Audit world model alignment: periodically review agent decisions and verify that they align with institutional world model.
Retrain agents on institutional context: if world model misalignment is detected, retrain the agent on updated institutional context data.
Track override patterns: if agents are frequently overridden for world model misalignment, investigate and fix the underlying context gap.

Related Risks

Address This Risk in Your Institution

World Model Misalignment requires architectural controls that go beyond what existing frameworks provide. Our advisory engagements are purpose-built for banks, insurers, and financial institutions subject to prudential oversight.

Schedule a Briefing

Agentic AI Risk & Controls Workshop Our Methodology Regulatory Landscape