V1.5 Catalog Integration — Architecture Deep-Dive

This page is the technical companion to the catalogs index. The user-facing walkthrough lives there. This page documents why the architecture looks the way it does, what invariants it maintains across all seven catalogs, and where the extension points live for contributors.

Audience: contributors adding a new catalog adapter, operators debugging an unexpected forge result, security teams auditing the read-only contract.

What V1.5 ships

Seven catalog adapters, all behind one ABC, all callable through two surfaces:

                                ┌─ CLI: fluid forge data-model from-source
                                │
      ┌─────────────────────────┤
      │                         │
      │                         └─ MCP: forge_from_source tool
      │
      ▼
┌──────────────────────────────────────────────────────────────────┐
│   Catalog dispatch                                               │
│   _build_catalog_adapter / _SOURCE_ADAPTERS                      │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
       ┌─────────┬───────────┬───┴───────┬──────────┬────────┐
       ▼         ▼           ▼           ▼          ▼        ▼
   Snowflake  Unity      BigQuery    Dataplex    Glue     DataHub  DMM
   adapter   adapter      adapter     adapter   adapter   adapter  adapter
       │         │           │           │          │        │      │
       └────┬────┴───────────┴───┬───────┴──────────┴────────┴──────┘
            ▼                    ▼
   CatalogAdapter ABC     CatalogTable, CatalogColumn,
   (4 abstract methods)   CatalogLineage, GlossaryTerm,
                          CatalogScope (Pydantic)
            │
            ▼
   LogicalAgent.from_catalog
   (translates CatalogTable[] → TableDefinition[] → LogicalDraft)
            │
            ▼
   Staged pipeline (Logical → Builder → Readme → Transformation → Validator)
            │
            ▼
   Fluid contract + .model.json sidecar + OSI v0.1.1 standalone

The four invariants

Every catalog adapter — current and future — must hold these:

1. Read-only on metadata only

No SELECT * against any data table. Adapters request metadata privileges (INFORMATION_SCHEMA-equivalent on each catalog) and nothing more. Tests pin this: every adapter's list_tables / get_table invocations on the stubbed SDK use only metadata APIs.

2. Lazy SDK import

The adapter module imports cleanly even when its underlying SDK isn't installed. The actual SDK import happens on first method call, raising CatalogConfigError with the exact pip install suggestion if the dep is missing. This keeps fluid --help fast (no 500ms boto3 import on cold start) and lets users pip install data-product-forge without forcing every catalog SDK.

def _client(self):
    try:
        import snowflake.connector  # type: ignore[import-not-found]
    except ImportError as exc:
        raise CatalogConfigError(
            "snowflake-connector-python missing. Install with: "
            'pip install "data-product-forge[snowflake]"',
            suggestions=['pip install "data-product-forge[snowflake]"'],
        ) from exc
    ...

3. Per-call client lifecycle

The adapter constructs / closes its SDK client per logical operation. No long-lived connection pool spans tool calls inside fluid mcp serve — defends against an upstream agent escalating "list one schema" into "dump everything" by replaying old credentials in a long-running session.

4. Soft-fail on optional reads

Lineage / glossary / sensitivity-tag reads return empty results when the user lacks the privilege OR the API isn't enabled, instead of erroring. The forge isn't blocked by missing optional metadata — you get a working contract, just without the optional signal.

These four invariants are documented in fluid_build/copilot/catalog/_patterns.py and reused across all seven adapters.

The nine reusable patterns

_patterns.py exports three helpers and documents nine patterns:

Pattern	Helper	Used by
1. Soft-fail on optional reads	`safe_metadata_call`	Snowflake (PK/FK), Unity (lineage), Dataplex (glossary), DataHub (lineage), all
2. Identifier validation + quoting	`validate_and_quote_identifier`	Snowflake, BigQuery, all SQL-issuing adapters
3. Per-call client lifecycle	(built into adapter shape)	DMM (httpx), all
4. Lazy SDK import + typed config error	(built into adapter shape)	All adapters
5. `from_resolver` classmethod	(built into adapter shape)	All adapters
6. Audit context excludes secrets	`audit_context()` Pydantic excludes `SecretStr`	All adapters
7. Vendor error → typed error translation	`translate_permission_or_connection_error`	All adapters
8. Two-pass fetching (list → inspect)	(architectural pattern)	All adapters
9. No data values ever	(test-pinned across all adapters)	All adapters

A new adapter follows the same nine patterns. The walkthrough at CONTRIBUTING.md → Adding a Catalog Adapter walks through each.

Three-stage catalog metadata flow

Catalog signal isn't just an input to the Logical stage; it shapes every stage of the pipeline:

Stage 1 — Logical (Conceptual + IR + OSI)

Catalog field	Logical IR field
Table name + description	`Conceptual.entities[].name + .description`
Column name + description	`OSIDataset.fields[].name + .expression.description`
Primary key	`OSIDataset.primary_key[]`
Foreign key + lineage	`OSIRelationship[]` (deterministic)
Tags (`domain: party`)	matches `IndustryPack` skeleton tag-hints
Classifications (PII/PHI/PCI)	`OSI.custom_extensions[]`
Glossary terms / synonyms	`OSI.ai_context.synonyms` + `examples`

Entry point: LogicalAgent.from_catalog(...).

Stage 2 — Contract (BuilderAgent emits Fluid 0.7.2)

Catalog field	Fluid contract field
Table owner + steward	`metadata.owner.team` (system roles excluded), `metadata.steward`
Business domain	`metadata.domain`
Sensitivity classifications	`agentPolicy.sensitiveData[]`
Data residency	`metadata.sovereignty.jurisdiction`
Quality SLAs	`exposes[].qos`
Lineage chain	`metadata.lineage.upstream[]`
Certified marker	`metadata.certification`

The BuilderAgent prompt directive: every metadata field the catalog supplied must appear verbatim in the contract; do not re-invent.

Stage 3 — Transformation (TransformationAgent emits dbt builds[])

Catalog field	Transformation output
Partition keys	dbt `partition_by` config
Clustering keys	dbt `cluster_by` config
Quality rules	dbt `tests:` (`not_null`, `unique`, `accepted_values`)
Freshness SLA	dbt `freshness:` config
Lineage upstream	dbt `ref()` / `source()` references
Sensitivity classifications	dbt `meta:` block + column-level tagging

System roles never become owners

A specific design decision worth calling out: catalog adapters that surface a "creating role" or "table owner" field (Snowflake OWNER, Glue Owner, etc.) do not promote Snowflake-style system roles into metadata.owner.team.

_SYSTEM_ROLE_NAMES: frozenset[str] = frozenset({
    "ACCOUNTADMIN", "SYSADMIN", "SECURITYADMIN", "USERADMIN",
    "ORGADMIN", "PUBLIC", "ADMIN", "ADMINS", "ROOT", "DBO",
    "POSTGRES",
})

These names land in labels.catalogCreatingRoles (audit only) instead. The contract's owner field reflects the business team, not the role that ran the DDL — preventing every Snowflake table from being attributed to ACCOUNTADMIN.

Industry auto-detection

When a catalog scope is forged, the adapter's catalog tags are matched against INDUSTRY_DOMAIN_HINTS to auto-pick an industry pack. Today's hints:

Domain tag	Industry pack
`telco`, `telecommunications`, `cdr`, `network`	`telecommunications`
`healthcare`, `health`, `clinical`, `patient`, `phi`	`healthcare`
`finance`, `banking`, `fraud`, `transaction`, `pci`	`finance`
`retail`, `commerce`, `pos`, `merchandise`, `customer`	`retail`

The match wins per-table; the most common hit per scope wins as the chosen industry. Operators can always override with --industry.

The mapping is part of the public API (pinned in tests/test_public_api_stability.py) so external scripts can predict the chosen industry before the actual forge runs.

Three-layer security model

V1.5 inherits the existing MCP server's three-layer access control and adds catalog-specific defenses:

Layer 1: Tool allow/deny list

The --allowed-tools / --denied-tools flags on fluid mcp serve control which tools an MCP client can call. Catalog tools are opt-out: removing forge_from_source (the only mutating catalog tool) leaves the four read-only catalog tools available without ever risking writes.

Layer 2: Read-only mode

fluid mcp serve --read-only blocks every tool that mutates files or writes store namespaces. forge_from_source is blocked; list_source_tables / inspect_source_table / list_source_lineage / list_source_glossary / list_source_adapters all run.

Layer 3: Sandboxes

--writable-paths pins the filesystem roots forge_from_source may write to; --writable-namespaces pins the conceptual store namespaces. Defaults are safe: writable_paths=cwd, writable_namespaces={history, audit}.

Catalog-specific defense: per-call credential lookup

The MCP server itself never holds catalog credentials. Each tool call passes credentials.credential_id — a string pointer into ~/.fluid/sources.yaml. The server resolves the pointer at call time:

Inline credentials (only via direct CLI; never over MCP wire).
OS keyring (macOS Keychain / Windows Credential Manager / Linux secret-service).
~/.fluid/sources.yaml (non-sensitive fields only — secrets live in keyring).
Environment variables.
Cloud metadata service (opt-in via allow_metadata_service=true).
Fail-closed.

Defends against an upstream agent escalating "list tables" into "dump every secret" by replaying old credentials in a long-running MCP session.

Audit trail

Every catalog tool call writes an audit event via copilot/store/audit_trail.py:

{
  "timestamp": "2026-04-25T14:33:21.123Z",
  "event": "catalog.list_source_tables",
  "catalog_name": "snowflake",
  "credential_id": "snowflake-prod",
  "scope": {"database": "BIZ_LAB", "schema": "SEEDED"},
  "result_summary": {"table_count": 23},
  "duration_ms": 8421
}

Credentials are scrubbed before write — audit_context() returns non-sensitive fields only. Query the trail with:

fluid memory show audit
fluid memory show audit --filter catalog.list_source_tables --window 24h

Determinism guarantees

Catalog reads are deterministic — same catalog state, same scope, same JSON shape. The forge as a whole is near-deterministic:

Cache hit — byte-identical output (cache key includes capability_matrix, so flipping a flag like extended_thinking invalidates cleanly).
Cache miss, temp=0 — near-identical entity names, relationships, hash keys. Minor wording in description / ai_context.instructions may vary by 1-2 tokens.
--deterministic flag — forces temp=0, seed=42 (where provider supports), cache off, emits proof-of-determinism report.

The capability_matrix segment in the cache key (added in V2 polish, Gap 7.3) is what makes "flipping a capability flag invalidates the cache cleanly" actually true:

generate_cache_key(
    model="claude-sonnet-4-6",
    prompt=prompt_blob,
    params=stage_params,
    capability_matrix={"extended_thinking": True},  # 4th hash segment
)

Without this, two runs with identical model/prompt/params but different capability matrices would collide on the same cached response — leaking a "no-thinking" answer into a "with-thinking" run. The hash segment makes them hash distinct.