Component model
The architecture reads as five layers. A request enters at the access layer, is served by the application layer, draws on the knowledge and AI layers, and is governed throughout. The API is the contract; the user interface is a thin client over it.
Access layer
A thin UI, the REST API as the governed contract, a set of lean MCP servers (explore, recipe, curate, agents, packaging, catalog, sync), a WebSocket channel for people, and HMAC-signed webhooks for agents.
Application layer
Services organised around the knowledge lifecycle: recipe and manifest, collect and crawl, the catalog pipeline, substrate and scores, query and warrant, curate and snapshots, manipulate, agents, admin, packaging, and the tenant control plane.
Agent layer
Governed agents (discover, ingest, process, review gaps, review inconsistency) with an orchestrator, a coordinator, an approval queue, per-agent controls, and circuit breakers. Every agent writes provenance to memory.
Knowledge & data layer
The two graphs and their indexes in the graph store, the operational relational store, object storage for raw sources, and the warrant log.
A fifth, the AI layer, supplies embeddings, document-type classification, reranking, and generation, plus web search and currency monitoring. Governance is not a layer on top; trust scoring, provenance, the veracity gate, and the warrant are part of the data path, so the substrate is governed by construction.
The two graphs
The substrate runs two distinct graphs, both co-located in the graph store, kept separate because they answer different questions and co-located so one query can use both.
| Retrieval graph | Causal Substrate | |
|---|---|---|
| Purpose | Find and ground passages | Govern trust and prove reasoning |
| Core nodes | Document, Chunk, Entity | Typed knowledge items, in families |
| Edges | Has-chunk, mentions, related-to | Causal and dependency edges |
| Indexes | HNSW vector, full-text | Scores, tiers, dependency traces |
| Used for | Hybrid recall and graph expand | Premise chains, cascades, coverage, the warrant |
The retrieval graph grounds the words of an answer; the Causal Substrate governs its trust and produces its premise chain. A query touches both: retrieval finds and grounds the passages, the substrate scores the premises and assembles the warrant. The same source feeds both during ingestion. Co-location keeps a single connection seam, one backup and restore path, and one region for residency.
The Corvair Ingestion Pipeline
Bringing a source in, and keeping it current
Triggered on new objects or on a schedule, the Corvair Ingestion Pipeline runs eight stages, batch and idempotent, safe to re-run because writes key off stable ids.
Collect
Crawl web and intranet against a manifest, accept uploads, or import; raw content lands in object storage.
Parse
Document AI turns each source into clean text and layout, with OCR fallback for scans.
Classify
A classifier assigns a document type (policy, regulation, contract, report, news, and so on), which selects chunking and extraction and seeds default tier and family hints.
Chunk
Split into passages of roughly 300 to 800 tokens with overlap, respecting layout boundaries.
Type & extract
Classify content into the typed-knowledge families and extract entities and relations, so the same source feeds both graphs.
Embed
Embed each chunk at 1536 dimensions; the model id is stored on the chunk so a model change is a deliberate re-embedding, not silent drift.
Write
Merge Document, Chunk, and Entity nodes and the typed items and edges into the graph store.
Catalog & score
Build the topic map, compute trust tiers and scores, and run the veracity gate.
Currency is continuous: watchers detect change at sources; web and intranet discovery find new material; a trickle update applies incremental change without a full rebuild; a retraction invalidates a source and cascades through the dependency graph. Snapshots and a blue/green swap keep the served base stable.
Query and answer
How a question becomes a warranted answer, online and low-latency, over the base the ingestion pipeline has built.
Embed the question
Same model and dimension as ingestion.
Hybrid recall & graph expand
Vector plus full-text recall, then a short traversal over entity relations for connected context (recall roughly the top 20).
Rerank
A ranking model keeps roughly the top 5.
Generate
The model answers from the reranked passages plus their entities and sources as grounded context.
Assemble
The answer with its premise chain, per-item validity and gravity, contextual confidence, causal edges, and source citations down to passage level, accumulated into a running source list.
Warrant
Produce a Validity Warrant for the answer, signed in production; it can cover one answer or a whole conversation.
Data stores
Three data tiers plus the warrant log. An export of the graph backup plus the operational store is a complete, movable substrate, and data residency follows the region across all tiers.
Snapshots and versioning are first-class: a base can be sealed into an immutable snapshot, snapshots can be diffed, and a blue/green swap promotes a new snapshot atomically.
The hybrid split
The split between the governed core and the research-and-monitor side is loosely coupled by design, not just as an on-premise special case. The core pushes out a recipe describing what to look for; research-and-monitor agents perform discovery and monitoring in the cloud and stage the files they find; the core pulls those staged files and runs them through its own security scan, veracity gate, and governance before admitting anything.
Out of the core
A compiled recipe fragment, a discovery brief: topics, families, sites, URL and exclusion lists, scope. No knowledge content, no PII, no premises.
Into the core
Staged candidate files plus a signed result manifest, written to a staging area the core pulls from. Nothing is admitted on arrival; everything passes the security scan and veracity gate.
Deployment topologies
Tenancy is first-class and always present, so the same code base serves one tenant or many. An installation hosts multiple knowledge bases; every route and store is scoped by tenant and knowledge base.
| Topology | For |
|---|---|
| Single-tenant / dedicated | Most customers, in their own cloud, on-premise, or hybrid, with dedicated stores and full isolation. |
| Shared SaaS | Smaller customers, several tenants on shared infrastructure with row-level isolation by owner tenant. |
| Cloud marketplace | Auto-installs into the customer's own cloud. |
| Managed service | Run on the customer's behalf. |
The reference deployment co-locates the graph store, both Cloud Run workloads, the operational store, and the AI calls in a single region, to keep retrieval latency low and residency clear. Where serverless is unavailable, the same containers run on a hosted container platform or Kubernetes equivalent, so on-premise is never blocked. Latency and residency are the trade-offs to manage in a split: co-location minimises latency, separation adds sovereignty.
The Validity Warrant
The warrant is a signed, replayable record of what the substrate knew and how it reached an answer or a decision. One pipeline signs both the knowledge-time warrant (an answer) and the decision-time warrant (an action against a snapshot). The signed payload has five layers: the answer with its confidence; the premise chain with per-item validity and gravity; the causal edges; the source lineage with tier and passage reference; and metadata with recipe version and timestamp. Binding fields make it replayable: recipe version, snapshot id, previous-warrant hash, and a content hash for every cited source.
Canonicalize & hash
Deterministic serialisation (RFC 8785 JSON canonicalization), SHA-384.
Sign
With the governance authority key, as a W3C Verifiable Credential Data Integrity proof; ECDSA P-384 with SHA-384 for v1.
Timestamp
An RFC 3161 trusted timestamp from a third party, not the issuer clock.
Chain & log
Link each warrant to the previous by hash and append to a Merkle transparency log with an inclusion proof.
Verify
Offline and mechanical: re-hash, check the signature and key validity, verify the timestamp and Merkle inclusion, and re-fetch sources to flag drift (drift is reported, not failed).
The signing key is generated in and never leaves Cloud KMS or an HSM; a separate governance-authority identity signs warrants; keys rotate on schedule with old public keys retained so historical warrants still verify. Algorithm agility keeps the path open to post-quantum signatures. The warrant survives right-to-erasure because it holds hashes and salted commitments, not content.
Technology stack
The reference build is Google Cloud native, using the Gemini Enterprise Agent Platform for the AI calls. Serverless-first, with a hosted or self-managed fallback for every serverless choice.
| Responsibility | Service |
|---|---|
| Knowledge store | Neo4j AuraDB on GCP (graphs, HNSW vector, full-text; Bolt and Cypher) |
| Operational store | Cloud SQL for PostgreSQL (with LISTEN/NOTIFY) |
| Raw landing zone | Cloud Storage |
| Parsing & OCR | Document AI |
| Ingestion / serving compute | Cloud Run Job (batch) / Cloud Run Service (API and MCP) |
| Embeddings | gemini-embedding-001 (1536 dimensions) |
| Reranking | Ranking API (semantic-ranker-default-004) |
| Generation | Gemini (interactive and batch roles) |
| Web search & monitoring | Exa, Google grounded web search, crawl4ai (intranet) |
| Secrets / keys | Secret Manager / Cloud KMS or HSM |
| Identity | IAM plus Workload Identity |
| Observability | Cloud Logging, Monitoring, and Trace |
Deliberately not used in v1: a separate vector database (AuraDB is the vector store), a managed retrieval engine that keeps vectors outside the graph, and GKE or a streaming cluster (Cloud Run covers both ingestion and serving). The portability path is real: other clouds via multi-cloud AuraDB or self-managed Neo4j Enterprise; on-premise or air-gapped via self-managed Neo4j; and the AI layer reduces to three swappable calls (embed, rerank, generate) behind one retrieval interface.
Where OKF fits
The OKF bundle store in object storage is the durable system of record for ingested knowledge; the two graphs and the search indexes are rebuildable projections. Ingestion drafts OKF, one loader populates the graph from OKF, and scoring and warranting finalise the bundle. Backup, disaster recovery, and migration therefore operate on the bundles: restore them and replay the loader to rebuild the graph, with no separate graph backup to keep authoritative. See the knowledge-format page for the full design.