Architecture | Corvair Knowledge Substrate

Component model

The architecture reads as five layers. A request enters at the access layer, is served by the application layer, draws on the knowledge and AI layers, and is governed throughout. The API is the contract; the user interface is a thin client over it.

para

Access layer

A thin UI, the REST API as the governed contract, a set of lean MCP servers (explore, recipe, curate, agents, packaging, catalog, sync), a WebSocket channel for people, and HMAC-signed webhooks for agents.

widgets

Application layer

Services organised around the knowledge lifecycle: recipe and manifest, collect and crawl, the catalog pipeline, substrate and scores, query and warrant, curate and snapshots, manipulate, agents, admin, packaging, and the tenant control plane.

smart_toy

Agent layer

Governed agents (discover, ingest, process, review gaps, review inconsistency) with an orchestrator, a coordinator, an approval queue, per-agent controls, and circuit breakers. Every agent writes provenance to memory.

database

Knowledge & data layer

The two graphs and their indexes in the graph store, the operational relational store, object storage for raw sources, and the warrant log.

A fifth, the AI layer, supplies embeddings, document-type classification, reranking, and generation, plus web search and currency monitoring. Governance is not a layer on top; trust scoring, provenance, the veracity gate, and the warrant are part of the data path, so the substrate is governed by construction.

The two graphs

The substrate runs two distinct graphs, both co-located in the graph store, kept separate because they answer different questions and co-located so one query can use both.

	Retrieval graph	Causal Substrate
Purpose	Find and ground passages	Govern trust and prove reasoning
Core nodes	Document, Chunk, Entity	Typed knowledge items, in families
Edges	Has-chunk, mentions, related-to	Causal and dependency edges
Indexes	HNSW vector, full-text	Scores, tiers, dependency traces
Used for	Hybrid recall and graph expand	Premise chains, cascades, coverage, the warrant

The retrieval graph grounds the words of an answer; the Causal Substrate governs its trust and produces its premise chain. A query touches both: retrieval finds and grounds the passages, the substrate scores the premises and assembles the warrant. The same source feeds both during ingestion. Co-location keeps a single connection seam, one backup and restore path, and one region for residency.

The Corvair Ingestion Pipeline

Bringing a source in, and keeping it current

Triggered on new objects or on a schedule, the Corvair Ingestion Pipeline runs eight stages, batch and idempotent, safe to re-run because writes key off stable ids.

Collect

Crawl web and intranet against a manifest, accept uploads, or import; raw content lands in object storage.

Parse

Document AI turns each source into clean text and layout, with OCR fallback for scans.

Classify

A classifier assigns a document type (policy, regulation, contract, report, news, and so on), which selects chunking and extraction and seeds default tier and family hints.

Chunk

Split into passages of roughly 300 to 800 tokens with overlap, respecting layout boundaries.

Type & extract

Classify content into the typed-knowledge families and extract entities and relations, so the same source feeds both graphs.

Embed

Embed each chunk at 1536 dimensions; the model id is stored on the chunk so a model change is a deliberate re-embedding, not silent drift.

Write

Merge Document, Chunk, and Entity nodes and the typed items and edges into the graph store.

Catalog & score

Build the topic map, compute trust tiers and scores, and run the veracity gate.

Currency is continuous: watchers detect change at sources; web and intranet discovery find new material; a trickle update applies incremental change without a full rebuild; a retraction invalidates a source and cascades through the dependency graph. Snapshots and a blue/green swap keep the served base stable.

Query and answer

How a question becomes a warranted answer, online and low-latency, over the base the ingestion pipeline has built.

Embed the question

Same model and dimension as ingestion.

Hybrid recall & graph expand

Vector plus full-text recall, then a short traversal over entity relations for connected context (recall roughly the top 20).

Rerank

A ranking model keeps roughly the top 5.

Generate

The model answers from the reranked passages plus their entities and sources as grounded context.

Assemble

The answer with its premise chain, per-item validity and gravity, contextual confidence, causal edges, and source citations down to passage level, accumulated into a running source list.

Warrant

Produce a Validity Warrant for the answer, signed in production; it can cover one answer or a whole conversation.

Data stores

Three data tiers plus the warrant log. An export of the graph backup plus the operational store is a complete, movable substrate, and data residency follows the region across all tiers.

Knowledge store

Neo4j AuraDB on Google Cloud: both graphs, the HNSW vector index on chunk embeddings (1536-dim, cosine), and the full-text index. Reached only over Bolt and Cypher.

Operational store

Cloud SQL for PostgreSQL: recipes and manifests, jobs, personas and relationship-based access, subscriptions, agent controls and budgets, the change-event bus over LISTEN/NOTIFY, and the phase-1 warrant chain.

Object storage

Cloud Storage: raw sources as collected, the immutable landing zone that ingestion triggers off, so a re-parse or re-embed runs from source. Also the OKF bundle store.

Warrant log

Phase 1, a previous-hash chain in PostgreSQL; later, a Merkle transparency log with inclusion proofs, run per installation.

Snapshots and versioning are first-class: a base can be sealed into an immutable snapshot, snapshots can be diffed, and a blue/green swap promotes a new snapshot atomically.

The hybrid split

The split between the governed core and the research-and-monitor side is loosely coupled by design, not just as an on-premise special case. The core pushes out a recipe describing what to look for; research-and-monitor agents perform discovery and monitoring in the cloud and stage the files they find; the core pulls those staged files and runs them through its own security scan, veracity gate, and governance before admitting anything.

north_east

Out of the core

A compiled recipe fragment, a discovery brief: topics, families, sites, URL and exclusion lists, scope. No knowledge content, no PII, no premises.

south_west

Into the core

Staged candidate files plus a signed result manifest, written to a staging area the core pulls from. Nothing is admitted on arrival; everything passes the security scan and veracity gate.

shield

In the separated case only a recipe goes out and only staged candidate files come back, so the institution's knowledge never leaves its environment while it still benefits from cloud-scale research and monitoring. The research side is treated as untrusted input; results are signed so the core can verify origin and detect tampering.

Deployment topologies

Tenancy is first-class and always present, so the same code base serves one tenant or many. An installation hosts multiple knowledge bases; every route and store is scoped by tenant and knowledge base.

Topology	For
Single-tenant / dedicated	Most customers, in their own cloud, on-premise, or hybrid, with dedicated stores and full isolation.
Shared SaaS	Smaller customers, several tenants on shared infrastructure with row-level isolation by owner tenant.
Cloud marketplace	Auto-installs into the customer's own cloud.
Managed service	Run on the customer's behalf.

The reference deployment co-locates the graph store, both Cloud Run workloads, the operational store, and the AI calls in a single region, to keep retrieval latency low and residency clear. Where serverless is unavailable, the same containers run on a hosted container platform or Kubernetes equivalent, so on-premise is never blocked. Latency and residency are the trade-offs to manage in a split: co-location minimises latency, separation adds sovereignty.

The Validity Warrant

The warrant is a signed, replayable record of what the substrate knew and how it reached an answer or a decision. One pipeline signs both the knowledge-time warrant (an answer) and the decision-time warrant (an action against a snapshot). The signed payload has five layers: the answer with its confidence; the premise chain with per-item validity and gravity; the causal edges; the source lineage with tier and passage reference; and metadata with recipe version and timestamp. Binding fields make it replayable: recipe version, snapshot id, previous-warrant hash, and a content hash for every cited source.

Canonicalize & hash

Deterministic serialisation (RFC 8785 JSON canonicalization), SHA-384.

Sign

With the governance authority key, as a W3C Verifiable Credential Data Integrity proof; ECDSA P-384 with SHA-384 for v1.

Timestamp

An RFC 3161 trusted timestamp from a third party, not the issuer clock.

Chain & log

Link each warrant to the previous by hash and append to a Merkle transparency log with an inclusion proof.

Verify

Offline and mechanical: re-hash, check the signature and key validity, verify the timestamp and Merkle inclusion, and re-fetch sources to flag drift (drift is reported, not failed).

The signing key is generated in and never leaves Cloud KMS or an HSM; a separate governance-authority identity signs warrants; keys rotate on schedule with old public keys retained so historical warrants still verify. Algorithm agility keeps the path open to post-quantum signatures. The warrant survives right-to-erasure because it holds hashes and salted commitments, not content.

Technology stack

The reference build is Google Cloud native, using the Gemini Enterprise Agent Platform for the AI calls. Serverless-first, with a hosted or self-managed fallback for every serverless choice.

Responsibility	Service
Knowledge store	Neo4j AuraDB on GCP (graphs, HNSW vector, full-text; Bolt and Cypher)
Operational store	Cloud SQL for PostgreSQL (with LISTEN/NOTIFY)
Raw landing zone	Cloud Storage
Parsing & OCR	Document AI
Ingestion / serving compute	Cloud Run Job (batch) / Cloud Run Service (API and MCP)
Embeddings	gemini-embedding-001 (1536 dimensions)
Reranking	Ranking API (semantic-ranker-default-004)
Generation	Gemini (interactive and batch roles)
Web search & monitoring	Exa, Google grounded web search, crawl4ai (intranet)
Secrets / keys	Secret Manager / Cloud KMS or HSM
Identity	IAM plus Workload Identity
Observability	Cloud Logging, Monitoring, and Trace

Deliberately not used in v1: a separate vector database (AuraDB is the vector store), a managed retrieval engine that keeps vectors outside the graph, and GKE or a streaming cluster (Cloud Run covers both ingestion and serving). The portability path is real: other clouds via multi-cloud AuraDB or self-managed Neo4j Enterprise; on-premise or air-gapped via self-managed Neo4j; and the AI layer reduces to three swappable calls (embed, rerank, generate) behind one retrieval interface.

Where OKF fits

The OKF bundle store in object storage is the durable system of record for ingested knowledge; the two graphs and the search indexes are rebuildable projections. Ingestion drafts OKF, one loader populates the graph from OKF, and scoring and warranting finalise the bundle. Backup, disaster recovery, and migration therefore operate on the bundles: restore them and replay the loader to rebuild the graph, with no separate graph backup to keep authoritative. See the knowledge-format page for the full design.