V1.5 Catalog Integration — Architecture Deep-Dive
This page is the technical companion to the catalogs index. The user-facing walkthrough lives there. This page documents why the architecture looks the way it does, what invariants it maintains across all seven catalogs, and where the extension points live for contributors.
Audience: contributors adding a new catalog adapter, operators debugging an unexpected forge result, security teams auditing the read-only contract.
What V1.5 ships
Seven catalog adapters, all behind one ABC, all callable through two surfaces:
┌─ CLI: fluid forge data-model from-source
│
┌─────────────────────────┤
│ │
│ └─ MCP: forge_from_source tool
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Catalog dispatch │
│ _build_catalog_adapter / _SOURCE_ADAPTERS │
└────────────────────────────────┬─────────────────────────────────┘
│
┌─────────┬───────────┬───┴───────┬──────────┬────────┐
▼ ▼ ▼ ▼ ▼ ▼
Snowflake Unity BigQuery Dataplex Glue DataHub DMM
adapter adapter adapter adapter adapter adapter adapter
│ │ │ │ │ │ │
└────┬────┴───────────┴───┬───────┴──────────┴────────┴──────┘
▼ ▼
CatalogAdapter ABC CatalogTable, CatalogColumn,
(4 abstract methods) CatalogLineage, GlossaryTerm,
CatalogScope (Pydantic)
│
▼
LogicalAgent.from_catalog
(translates CatalogTable[] → TableDefinition[] → LogicalDraft)
│
▼
Staged pipeline (Logical → Builder → Readme → Transformation → Validator)
│
▼
Fluid contract + .model.json sidecar + OSI v0.1.1 standalone
The four invariants
Every catalog adapter — current and future — must hold these:
1. Read-only on metadata only
No SELECT * against any data table. Adapters request metadata privileges (INFORMATION_SCHEMA-equivalent on each catalog) and nothing more. Tests pin this: every adapter's list_tables / get_table invocations on the stubbed SDK use only metadata APIs.
2. Lazy SDK import
The adapter module imports cleanly even when its underlying SDK isn't installed. The actual SDK import happens on first method call, raising CatalogConfigError with the exact pip install suggestion if the dep is missing. This keeps fluid --help fast (no 500ms boto3 import on cold start) and lets users pip install data-product-forge without forcing every catalog SDK.
def _client(self):
try:
import snowflake.connector # type: ignore[import-not-found]
except ImportError as exc:
raise CatalogConfigError(
"snowflake-connector-python missing. Install with: "
'pip install "data-product-forge[snowflake]"',
suggestions=['pip install "data-product-forge[snowflake]"'],
) from exc
...
3. Per-call client lifecycle
The adapter constructs / closes its SDK client per logical operation. No long-lived connection pool spans tool calls inside fluid mcp serve — defends against an upstream agent escalating "list one schema" into "dump everything" by replaying old credentials in a long-running session.
4. Soft-fail on optional reads
Lineage / glossary / sensitivity-tag reads return empty results when the user lacks the privilege OR the API isn't enabled, instead of erroring. The forge isn't blocked by missing optional metadata — you get a working contract, just without the optional signal.
These four invariants are documented in fluid_build/copilot/catalog/_patterns.py and reused across all seven adapters.
The nine reusable patterns
_patterns.py exports three helpers and documents nine patterns:
| Pattern | Helper | Used by |
|---|---|---|
| 1. Soft-fail on optional reads | safe_metadata_call | Snowflake (PK/FK), Unity (lineage), Dataplex (glossary), DataHub (lineage), all |
| 2. Identifier validation + quoting | validate_and_quote_identifier | Snowflake, BigQuery, all SQL-issuing adapters |
| 3. Per-call client lifecycle | (built into adapter shape) | DMM (httpx), all |
| 4. Lazy SDK import + typed config error | (built into adapter shape) | All adapters |
5. from_resolver classmethod | (built into adapter shape) | All adapters |
| 6. Audit context excludes secrets | audit_context() Pydantic excludes SecretStr | All adapters |
| 7. Vendor error → typed error translation | translate_permission_or_connection_error | All adapters |
| 8. Two-pass fetching (list → inspect) | (architectural pattern) | All adapters |
| 9. No data values ever | (test-pinned across all adapters) | All adapters |
A new adapter follows the same nine patterns. The walkthrough at CONTRIBUTING.md → Adding a Catalog Adapter walks through each.
Three-stage catalog metadata flow
Catalog signal isn't just an input to the Logical stage; it shapes every stage of the pipeline:
Stage 1 — Logical (Conceptual + IR + OSI)
| Catalog field | Logical IR field |
|---|---|
| Table name + description | Conceptual.entities[].name + .description |
| Column name + description | OSIDataset.fields[].name + .expression.description |
| Primary key | OSIDataset.primary_key[] |
| Foreign key + lineage | OSIRelationship[] (deterministic) |
Tags (domain: party) | matches IndustryPack skeleton tag-hints |
| Classifications (PII/PHI/PCI) | OSI.custom_extensions[] |
| Glossary terms / synonyms | OSI.ai_context.synonyms + examples |
Entry point: LogicalAgent.from_catalog(...).
Stage 2 — Contract (BuilderAgent emits Fluid 0.7.2)
| Catalog field | Fluid contract field |
|---|---|
| Table owner + steward | metadata.owner.team (system roles excluded), metadata.steward |
| Business domain | metadata.domain |
| Sensitivity classifications | agentPolicy.sensitiveData[] |
| Data residency | metadata.sovereignty.jurisdiction |
| Quality SLAs | exposes[].qos |
| Lineage chain | metadata.lineage.upstream[] |
| Certified marker | metadata.certification |
The BuilderAgent prompt directive: every metadata field the catalog supplied must appear verbatim in the contract; do not re-invent.
Stage 3 — Transformation (TransformationAgent emits dbt builds[])
| Catalog field | Transformation output |
|---|---|
| Partition keys | dbt partition_by config |
| Clustering keys | dbt cluster_by config |
| Quality rules | dbt tests: (not_null, unique, accepted_values) |
| Freshness SLA | dbt freshness: config |
| Lineage upstream | dbt ref() / source() references |
| Sensitivity classifications | dbt meta: block + column-level tagging |
System roles never become owners
A specific design decision worth calling out: catalog adapters that surface a "creating role" or "table owner" field (Snowflake OWNER, Glue Owner, etc.) do not promote Snowflake-style system roles into metadata.owner.team.
_SYSTEM_ROLE_NAMES: frozenset[str] = frozenset({
"ACCOUNTADMIN", "SYSADMIN", "SECURITYADMIN", "USERADMIN",
"ORGADMIN", "PUBLIC", "ADMIN", "ADMINS", "ROOT", "DBO",
"POSTGRES",
})
These names land in labels.catalogCreatingRoles (audit only) instead. The contract's owner field reflects the business team, not the role that ran the DDL — preventing every Snowflake table from being attributed to ACCOUNTADMIN.
Industry auto-detection
When a catalog scope is forged, the adapter's catalog tags are matched against INDUSTRY_DOMAIN_HINTS to auto-pick an industry pack. Today's hints:
| Domain tag | Industry pack |
|---|---|
telco, telecommunications, cdr, network | telecommunications |
healthcare, health, clinical, patient, phi | healthcare |
finance, banking, fraud, transaction, pci | finance |
retail, commerce, pos, merchandise, customer | retail |
The match wins per-table; the most common hit per scope wins as the chosen industry. Operators can always override with --industry.
The mapping is part of the public API (pinned in tests/test_public_api_stability.py) so external scripts can predict the chosen industry before the actual forge runs.
Three-layer security model
V1.5 inherits the existing MCP server's three-layer access control and adds catalog-specific defenses:
Layer 1: Tool allow/deny list
The --allowed-tools / --denied-tools flags on fluid mcp serve control which tools an MCP client can call. Catalog tools are opt-out: removing forge_from_source (the only mutating catalog tool) leaves the four read-only catalog tools available without ever risking writes.
Layer 2: Read-only mode
fluid mcp serve --read-only blocks every tool that mutates files or writes store namespaces. forge_from_source is blocked; list_source_tables / inspect_source_table / list_source_lineage / list_source_glossary / list_source_adapters all run.
Layer 3: Sandboxes
--writable-paths pins the filesystem roots forge_from_source may write to; --writable-namespaces pins the conceptual store namespaces. Defaults are safe: writable_paths=cwd, writable_namespaces={history, audit}.
Catalog-specific defense: per-call credential lookup
The MCP server itself never holds catalog credentials. Each tool call passes credentials.credential_id — a string pointer into ~/.fluid/sources.yaml. The server resolves the pointer at call time:
- Inline credentials (only via direct CLI; never over MCP wire).
- OS keyring (macOS Keychain / Windows Credential Manager / Linux secret-service).
~/.fluid/sources.yaml(non-sensitive fields only — secrets live in keyring).- Environment variables.
- Cloud metadata service (opt-in via
allow_metadata_service=true). - Fail-closed.
Defends against an upstream agent escalating "list tables" into "dump every secret" by replaying old credentials in a long-running MCP session.
Audit trail
Every catalog tool call writes an audit event via copilot/store/audit_trail.py:
{
"timestamp": "2026-04-25T14:33:21.123Z",
"event": "catalog.list_source_tables",
"catalog_name": "snowflake",
"credential_id": "snowflake-prod",
"scope": {"database": "BIZ_LAB", "schema": "SEEDED"},
"result_summary": {"table_count": 23},
"duration_ms": 8421
}
Credentials are scrubbed before write — audit_context() returns non-sensitive fields only. Query the trail with:
fluid memory show audit
fluid memory show audit --filter catalog.list_source_tables --window 24h
Determinism guarantees
Catalog reads are deterministic — same catalog state, same scope, same JSON shape. The forge as a whole is near-deterministic:
- Cache hit — byte-identical output (cache key includes
capability_matrix, so flipping a flag likeextended_thinkinginvalidates cleanly). - Cache miss, temp=0 — near-identical entity names, relationships, hash keys. Minor wording in
description/ai_context.instructionsmay vary by 1-2 tokens. --deterministicflag — forces temp=0, seed=42 (where provider supports), cache off, emits proof-of-determinism report.
The capability_matrix segment in the cache key (added in V2 polish, Gap 7.3) is what makes "flipping a capability flag invalidates the cache cleanly" actually true:
generate_cache_key(
model="claude-sonnet-4-6",
prompt=prompt_blob,
params=stage_params,
capability_matrix={"extended_thinking": True}, # 4th hash segment
)
Without this, two runs with identical model/prompt/params but different capability matrices would collide on the same cached response — leaking a "no-thinking" answer into a "with-thinking" run. The hash segment makes them hash distinct.
See also
- Catalogs index — user-facing walkthrough
- Credential resolver — security model details
- Cost tracking — V2 polish details
- MCP server — full MCP walkthrough