Trust boundaries
The design names four trust boundaries and keeps the most sensitive assets inside the innermost one.
Customer boundary
In dedicated, on-premise, hybrid, and marketplace topologies the platform, data, and audit trail stay inside the customer's boundary. Nothing identifying or sensitive leaves it except a privacy-scrubbed usage export the customer can inspect.
Tenant boundary
In shared SaaS, tenants are isolated by an owner-tenant scope on every store access; cross-tenant access happens only by explicit grant.
Service boundary
Each runtime has its own least-privilege service account. The warrant signing key is reachable only through one internal service.
External boundary
The research-and-monitor side reaches the open internet; the governed core ingests only through the security scan and veracity gate. In the hybrid split only a recipe goes out and only staged files come back.
Assets are ranked by sensitivity: customer knowledge content and PII; the warrant signing key; secrets; the audit trail and warrants; recipes and configuration; usage data; and platform availability.
Threat model
The top threats and their mitigations, from the threat table.
| Threat | Mitigation |
|---|---|
| Cross-tenant or cross-base access | Tenant-then-kb scoping, relationship-based access, persona gating on every route |
| Credential or key compromise | Secrets in Secret Manager, signing key in KMS or HSM never exported, Workload Identity (no static keys), least privilege |
| Prompt injection | Injection scan and neutralise at ingestion; agents act only through governed, persona-checked tools with budgets and breakers |
| PII leakage through answers | PII detection with per-role redact, block, or allow; the answer gate; output scrubbing before logs and exports |
| Tampering with the audit record | Signed, timestamped, chained warrants; a transparency log; offline verification |
| Agent privilege escalation | Autonomy levels, an approval queue, circuit breakers, every action audited |
| Supply-chain compromise | Permissive-licence-only policy, an SBOM and licence gate in CI, pinned and scanned dependencies |
| Insider or operator overreach | The administrator runs the platform but cannot read content; steward actions are preview-then-commit and warranted |
| Denial of service or runaway cost | Quotas per tenant and persona, agent budgets, circuit breakers |
| Man-in-the-middle | TLS everywhere, private connectivity to the data store, no public store endpoints in production |
Encryption & keys
Identity, secrets & network
People authenticate by OIDC single sign-on; agents by a scoped, revocable JWT or API key. No shared accounts. IAM is least-privilege per workload, and the administrator role excludes content access: the administrator runs the platform but cannot read knowledge content, enforced both in IAM and at the API. Workload Identity removes static keys. The network uses a custom VPC, default-deny ingress, private connectivity to the data stores, stable egress for allowlisting, and private DNS; the warrant service is internal-ingress only, and there are no public store endpoints in production. All secrets live in Secret Manager, referenced by name, never in code, logs, or infrastructure state.
AI-specific controls
Prompt injection
Scanned and neutralised at ingestion before extraction, so untrusted content cannot carry instructions into the extractor or downstream agents.
PII & exfiltration
PII detection with per-role redact, block, or allow; an answer gate; output scrubbing before logs and exports.
Grounding
Answers are grounded only in retrieved context and carry a premise chain and citations; the veracity gate refuses high-stakes grounding on a single low-tier source.
Agent containment
Agents act only through governed tools within autonomy levels, budgets, and circuit breakers; every action is audited and reversible, with an approval queue for escalation.
Data lifecycle & erasure
Each data class has its own retention, and retention windows are versioned configuration so tuning is auditable.
| Class | Default retention | Erasable |
|---|---|---|
| Raw sources | Per recipe and residency window | Yes, with cascade |
| Derived knowledge (chunks, items, embeddings) | Life of the base | Yes, via retraction or erasure |
| Snapshots | Per snapshot policy | Yes, oldest first |
| Agentic memory | Per policy (decisions kept longer) | Yes, with care for warranted decisions |
| Warrants | Long, for the evidence window | Hashes and commitments kept; raw content erasable |
| Usage data | Local window; central is anonymous | Local erasable; central carries no identity |
Right-to-erasure
A subject-data-erasure request resolves to items, sources, memories, and spans, then erases across every store in one governed, provenanced operation.
Locate
Find the personal data: PII spans, grounded items, source documents, memory items, exports.
Preview
Show the cascade as an impact diff, exactly like a retraction, so the effect is visible before commit.
Erase per store
Delete all versions in the raw bucket; delete or redact in the graph with a retraction cascade; delete or redact rows in the operational store; erase memory payloads.
Preserve the proof
The warrant chain holds because erased premises and sources remain as salted commitments and content hashes, never raw personal data, so the signature still verifies but the content cannot be revealed.
Record
The erasure writes provenance and, where decision-bearing, a warrant of the erasure itself.
Tenancy & residency
Tenancy is a first-class, always-present concept: every knowledge base and principal belongs to a tenant, and single-tenant is simply one tenant. A small control plane, scoped to the platform operator and excluded from MCP, provisions and configures tenants but never reads knowledge content; the data plane serves and governs knowledge, resolving the tenant before the knowledge base. Shared deployments use row-level isolation by owner tenant, enforced in the data-access layer rather than left to callers; dedicated tenants get their own stores and optionally their own transparency log and governance key, with an identical schema so code does not branch. Data residency is pinned per tenant: all of a tenant's data and AI calls stay in its region, and a residency change is an explicit migrate-and-repoint, not a flag flip. Per-tenant quotas cover knowledge bases, storage, daily tokens, and seats, and exceeding one throttles or refuses with a clear error rather than degrading a neighbour.
Resilience & DR
Recovery targets are set per environment and per tenant plan, and confirmed against measured restore times.
| Class | RPO (data loss) | RTO (downtime) |
|---|---|---|
| Operational store | Minutes (point-in-time recovery) | Low single-digit hours |
| Knowledge store | Hours (scheduled, tighter with continuous backup) | Low single-digit hours |
| Object storage | Near zero (versioned, durable) | Minutes |
| Warrants & chain | Zero tolerated loss | Restored with the operational store |
A restore brings back a consistent point across stores, replays the erasure log before serving, and restores into the tenant's residency region so residency holds through DR. Upgrades are zero-downtime: Cloud Run revisions shift traffic with rollback as a traffic shift back, and PostgreSQL schema changes are additive and backward-compatible (expand-then-contract) so old and new revisions run concurrently. Each release records its schema version and refuses to start against an incompatible one; a pack declares the substrate range it targets, checked before it is applied. Because the OKF bundle store is the system of record, DR can also restore the bundles and replay the loader to rebuild the graph and indexes.
Supply chain & testing
A permissive-licence-only policy fails the build on a GPL or AGPL transitive dependency; dependencies are pinned and scanned in CI, and images are built reproducibly and tagged by commit. Because the MCP cover and the API are generated from one contract, the agent surface cannot silently widen. Verification runs in three modes: static (IAM and policy checks, secret-reference linting, the SBOM and licence gate), dynamic (dependency and container scans, a periodic third-party penetration test against a deployed environment, tamper tests on the warrant), and continuous (a governance dashboard and alerts on breaker trips, warrant-verification failures, quarantine spikes, and auth failures).
Compliance posture
The substrate is built to support a customer's compliance obligations rather than to substitute for them. Concretely, it provides a tamper-evident, signed, timestamped, and chained audit record that verifies offline; a right-to-erasure workflow that operates across every store and survives backups while keeping the evidence chain intact; provenance on every governed action; data residency pinned per tenant and region; and a privacy-scrubbed usage export that carries no customer identity or content. The design supports a per-deployment data-protection impact assessment, which remains a customer-side activity.