Fluid Forge
Get Started
See it run
  • Local (DuckDB)
  • Source-Aligned (Postgres → DuckDB)
  • AI Forge + Data Models
  • GCP (BigQuery)
  • Snowflake Team Collaboration
  • Declarative Airflow
  • Orchestration Export
  • Jenkins CI/CD
  • Universal Pipeline
  • 11-Stage Production Pipeline
  • Catalog Forge End-to-End
CLI Reference
  • Overview
  • Quickstart
  • Examples
  • Your own CI
  • Your own scaffolding
  • Custom validator
  • Apply hook
  • Reference
Demos
  • Overview
  • Architecture
  • GCP (BigQuery)
  • AWS (S3 + Athena)
  • Snowflake
  • Local (DuckDB)
  • Custom Providers
  • Roadmap
GitHub
GitHub
Get Started
See it run
  • Local (DuckDB)
  • Source-Aligned (Postgres → DuckDB)
  • AI Forge + Data Models
  • GCP (BigQuery)
  • Snowflake Team Collaboration
  • Declarative Airflow
  • Orchestration Export
  • Jenkins CI/CD
  • Universal Pipeline
  • 11-Stage Production Pipeline
  • Catalog Forge End-to-End
CLI Reference
  • Overview
  • Quickstart
  • Examples
  • Your own CI
  • Your own scaffolding
  • Custom validator
  • Apply hook
  • Reference
Demos
  • Overview
  • Architecture
  • GCP (BigQuery)
  • AWS (S3 + Athena)
  • Snowflake
  • Local (DuckDB)
  • Custom Providers
  • Roadmap
GitHub
GitHub
  • Introduction

    • Home
    • Getting Started
    • Snowflake Quickstart
    • See it run
    • Forge Data Model
    • Vision & Roadmap
    • Playground
    • FAQ
  • Concepts

    • Concepts
    • Builds, Exposes, Bindings
    • What is a contract?
    • Quality, SLAs & Lineage
    • Governance & Policy
    • Agent Policy (LLM/AI governance)
    • Providers vs Platforms
    • Fluid Forge vs alternatives
  • Data Products

    • Product Types — SDP, ADP, CDP
  • Walkthroughs

    • Walkthrough: Local Development
    • Source-Aligned: Postgres → DuckDB → Parquet
    • AI Forge And Data-Model Journeys
    • Walkthrough: Deploy to Google Cloud Platform
    • Walkthrough: Snowflake Team Collaboration
    • Declarative Airflow DAG Generation - The FLUID Way
    • Generating Orchestration Code from Contracts
    • Jenkins CI/CD for FLUID Data Products
    • Universal Pipeline
    • The 11-Stage Pipeline
    • End-to-End Walkthrough: Catalog → Contract → Transformation
  • CLI Reference

    • CLI Reference
    • fluid init
    • fluid demo
    • fluid forge
    • fluid skills
    • fluid status
    • fluid validate
    • fluid plan
    • fluid apply
    • fluid generate
    • fluid generate artifacts
    • fluid validate-artifacts
    • fluid verify-signature
    • fluid generate-airflow
    • fluid generate-pipeline
    • fluid viz-graph
    • fluid odps
    • fluid odps-bitol
    • fluid odcs
    • fluid export
    • fluid export-opds
    • fluid publish
    • fluid datamesh-manager
    • fluid market
    • fluid import
    • fluid policy
    • fluid policy check
    • fluid policy compile
    • fluid policy apply
    • fluid contract-tests
    • fluid contract-validation
    • fluid diff
    • fluid test
    • fluid verify
    • fluid product-new
    • fluid product-add
    • fluid workspace
    • fluid ide
    • fluid ai
    • fluid memory
    • fluid mcp
    • fluid scaffold-ci
    • fluid scaffold-composer
    • fluid scaffold-ide
    • fluid docs
    • fluid config
    • fluid split
    • fluid bundle
    • fluid auth
    • fluid doctor
    • fluid providers
    • fluid provider-init
    • fluid roadmap
    • fluid version
    • fluid runs
    • fluid retention
    • fluid secrets
    • fluid stats
    • fluid contract
    • fluid ship
    • fluid rollback
    • fluid schedule-sync
    • Catalog adapters

      • Source Catalog Integration (V1.5)
      • BigQuery Catalog
      • Snowflake Horizon Catalog
      • Databricks Unity Catalog
      • Google Dataplex Catalog
      • AWS Glue Data Catalog
      • DataHub Catalog
      • Data Mesh Manager Catalog
    • CLI by task

      • CLI by task
      • Add quality rules
      • Add agent governance
      • Debug a failed pipeline run
      • Switch clouds with one line
  • Recipes

    • Recipes
    • Recipe — add a quality rule
    • Recipe — switch clouds with one line
    • Recipe — tag PII in your schema
  • SDK & Plugins

    • SDK & Plugins
    • Quickstart — your first plugin
    • Examples

      • Runnable examples
      • Example: hello-scaffold — the minimal viable plugin
      • Example: gitlab-ci-scaffold — generate a complete CI project
      • Example: steward-validator — a custom governance rule
      • Example: prod-key-guard — apply-time invariant check
    • Journeys

      • Journeys
      • Your own CI/CD

        • You have your own CI/CD setup, no problem
        • GitLab CI — the bundle template
        • GitHub Actions — the bundle template
        • Jenkins — the bundle template
        • CircleCI — the bundle template
      • You have a strict project layout, no problem
      • You have governance rules, no problem
      • You want a check at apply time, no problem
    • Reference

      • Reference
      • Roles reference
      • Entry points reference
      • Trust model
      • Packaging
      • Companion packages
  • Providers

    • Providers
    • Provider Architecture
    • GCP Provider
    • AWS Provider
    • Snowflake Provider
    • Local Provider
    • Creating Custom Providers
    • Provider Roadmap
  • Advanced

    • Blueprints
    • Governance & Compliance
    • Airflow Integration
    • Built-in And Custom Forge Guidance
    • FLUID Forge Contract GPT Packet
    • Forge Discovery Guide
    • Forge Memory Guide
    • LLM Providers
    • Capability Warnings
    • LiteLLM Backend (opt-in)
    • MCP Server
    • Credential Resolver — Security Model
    • Cost Tracking
    • Agentic Primitives
    • Typed Errors
    • Typed CLI Errors
    • Authoring Forge Tools
    • Source-Aligned Acquisition
    • API Stability — fluid_build.api
    • Guided fluid forge UX
    • V1.5 Catalog Integration — Architecture Deep-Dive
    • V1.5 + V2 Hardening — Release Notes
  • Project

    • Contributing to Fluid Forge
    • Fluid Forge Docs Baseline: CLI 0.8.3
    • Fluid Forge Docs Baseline: CLI 0.8.0
    • Fluid Forge Docs Baseline: CLI 0.7.11
    • Fluid Forge Docs Baseline: CLI 0.7.9
    • Fluid Forge v0.7.1 - Multi-Provider Export Release

V1.5 Catalog Integration — Architecture Deep-Dive

This page is the technical companion to the catalogs index. The user-facing walkthrough lives there. This page documents why the architecture looks the way it does, what invariants it maintains across all seven catalogs, and where the extension points live for contributors.

Audience: contributors adding a new catalog adapter, operators debugging an unexpected forge result, security teams auditing the read-only contract.

What V1.5 ships

Seven catalog adapters, all behind one ABC, all callable through two surfaces:

                                ┌─ CLI: fluid forge data-model from-source
                                │
      ┌─────────────────────────┤
      │                         │
      │                         └─ MCP: forge_from_source tool
      │
      ▼
┌──────────────────────────────────────────────────────────────────┐
│   Catalog dispatch                                               │
│   _build_catalog_adapter / _SOURCE_ADAPTERS                      │
└────────────────────────────────┬─────────────────────────────────┘
                                 │
       ┌─────────┬───────────┬───┴───────┬──────────┬────────┐
       ▼         ▼           ▼           ▼          ▼        ▼
   Snowflake  Unity      BigQuery    Dataplex    Glue     DataHub  DMM
   adapter   adapter      adapter     adapter   adapter   adapter  adapter
       │         │           │           │          │        │      │
       └────┬────┴───────────┴───┬───────┴──────────┴────────┴──────┘
            ▼                    ▼
   CatalogAdapter ABC     CatalogTable, CatalogColumn,
   (4 abstract methods)   CatalogLineage, GlossaryTerm,
                          CatalogScope (Pydantic)
            │
            ▼
   LogicalAgent.from_catalog
   (translates CatalogTable[] → TableDefinition[] → LogicalDraft)
            │
            ▼
   Staged pipeline (Logical → Builder → Readme → Transformation → Validator)
            │
            ▼
   Fluid contract + .model.json sidecar + OSI v0.1.1 standalone

The four invariants

Every catalog adapter — current and future — must hold these:

1. Read-only on metadata only

No SELECT * against any data table. Adapters request metadata privileges (INFORMATION_SCHEMA-equivalent on each catalog) and nothing more. Tests pin this: every adapter's list_tables / get_table invocations on the stubbed SDK use only metadata APIs.

2. Lazy SDK import

The adapter module imports cleanly even when its underlying SDK isn't installed. The actual SDK import happens on first method call, raising CatalogConfigError with the exact pip install suggestion if the dep is missing. This keeps fluid --help fast (no 500ms boto3 import on cold start) and lets users pip install data-product-forge without forcing every catalog SDK.

def _client(self):
    try:
        import snowflake.connector  # type: ignore[import-not-found]
    except ImportError as exc:
        raise CatalogConfigError(
            "snowflake-connector-python missing. Install with: "
            'pip install "data-product-forge[snowflake]"',
            suggestions=['pip install "data-product-forge[snowflake]"'],
        ) from exc
    ...

3. Per-call client lifecycle

The adapter constructs / closes its SDK client per logical operation. No long-lived connection pool spans tool calls inside fluid mcp serve — defends against an upstream agent escalating "list one schema" into "dump everything" by replaying old credentials in a long-running session.

4. Soft-fail on optional reads

Lineage / glossary / sensitivity-tag reads return empty results when the user lacks the privilege OR the API isn't enabled, instead of erroring. The forge isn't blocked by missing optional metadata — you get a working contract, just without the optional signal.

These four invariants are documented in fluid_build/copilot/catalog/_patterns.py and reused across all seven adapters.

The nine reusable patterns

_patterns.py exports three helpers and documents nine patterns:

PatternHelperUsed by
1. Soft-fail on optional readssafe_metadata_callSnowflake (PK/FK), Unity (lineage), Dataplex (glossary), DataHub (lineage), all
2. Identifier validation + quotingvalidate_and_quote_identifierSnowflake, BigQuery, all SQL-issuing adapters
3. Per-call client lifecycle(built into adapter shape)DMM (httpx), all
4. Lazy SDK import + typed config error(built into adapter shape)All adapters
5. from_resolver classmethod(built into adapter shape)All adapters
6. Audit context excludes secretsaudit_context() Pydantic excludes SecretStrAll adapters
7. Vendor error → typed error translationtranslate_permission_or_connection_errorAll adapters
8. Two-pass fetching (list → inspect)(architectural pattern)All adapters
9. No data values ever(test-pinned across all adapters)All adapters

A new adapter follows the same nine patterns. The walkthrough at CONTRIBUTING.md → Adding a Catalog Adapter walks through each.

Three-stage catalog metadata flow

Catalog signal isn't just an input to the Logical stage; it shapes every stage of the pipeline:

Stage 1 — Logical (Conceptual + IR + OSI)

Catalog fieldLogical IR field
Table name + descriptionConceptual.entities[].name + .description
Column name + descriptionOSIDataset.fields[].name + .expression.description
Primary keyOSIDataset.primary_key[]
Foreign key + lineageOSIRelationship[] (deterministic)
Tags (domain: party)matches IndustryPack skeleton tag-hints
Classifications (PII/PHI/PCI)OSI.custom_extensions[]
Glossary terms / synonymsOSI.ai_context.synonyms + examples

Entry point: LogicalAgent.from_catalog(...).

Stage 2 — Contract (BuilderAgent emits Fluid 0.7.2)

Catalog fieldFluid contract field
Table owner + stewardmetadata.owner.team (system roles excluded), metadata.steward
Business domainmetadata.domain
Sensitivity classificationsagentPolicy.sensitiveData[]
Data residencymetadata.sovereignty.jurisdiction
Quality SLAsexposes[].qos
Lineage chainmetadata.lineage.upstream[]
Certified markermetadata.certification

The BuilderAgent prompt directive: every metadata field the catalog supplied must appear verbatim in the contract; do not re-invent.

Stage 3 — Transformation (TransformationAgent emits dbt builds[])

Catalog fieldTransformation output
Partition keysdbt partition_by config
Clustering keysdbt cluster_by config
Quality rulesdbt tests: (not_null, unique, accepted_values)
Freshness SLAdbt freshness: config
Lineage upstreamdbt ref() / source() references
Sensitivity classificationsdbt meta: block + column-level tagging

System roles never become owners

A specific design decision worth calling out: catalog adapters that surface a "creating role" or "table owner" field (Snowflake OWNER, Glue Owner, etc.) do not promote Snowflake-style system roles into metadata.owner.team.

_SYSTEM_ROLE_NAMES: frozenset[str] = frozenset({
    "ACCOUNTADMIN", "SYSADMIN", "SECURITYADMIN", "USERADMIN",
    "ORGADMIN", "PUBLIC", "ADMIN", "ADMINS", "ROOT", "DBO",
    "POSTGRES",
})

These names land in labels.catalogCreatingRoles (audit only) instead. The contract's owner field reflects the business team, not the role that ran the DDL — preventing every Snowflake table from being attributed to ACCOUNTADMIN.

Industry auto-detection

When a catalog scope is forged, the adapter's catalog tags are matched against INDUSTRY_DOMAIN_HINTS to auto-pick an industry pack. Today's hints:

Domain tagIndustry pack
telco, telecommunications, cdr, networktelecommunications
healthcare, health, clinical, patient, phihealthcare
finance, banking, fraud, transaction, pcifinance
retail, commerce, pos, merchandise, customerretail

The match wins per-table; the most common hit per scope wins as the chosen industry. Operators can always override with --industry.

The mapping is part of the public API (pinned in tests/test_public_api_stability.py) so external scripts can predict the chosen industry before the actual forge runs.

Three-layer security model

V1.5 inherits the existing MCP server's three-layer access control and adds catalog-specific defenses:

Layer 1: Tool allow/deny list

The --allowed-tools / --denied-tools flags on fluid mcp serve control which tools an MCP client can call. Catalog tools are opt-out: removing forge_from_source (the only mutating catalog tool) leaves the four read-only catalog tools available without ever risking writes.

Layer 2: Read-only mode

fluid mcp serve --read-only blocks every tool that mutates files or writes store namespaces. forge_from_source is blocked; list_source_tables / inspect_source_table / list_source_lineage / list_source_glossary / list_source_adapters all run.

Layer 3: Sandboxes

--writable-paths pins the filesystem roots forge_from_source may write to; --writable-namespaces pins the conceptual store namespaces. Defaults are safe: writable_paths=cwd, writable_namespaces={history, audit}.

Catalog-specific defense: per-call credential lookup

The MCP server itself never holds catalog credentials. Each tool call passes credentials.credential_id — a string pointer into ~/.fluid/sources.yaml. The server resolves the pointer at call time:

  1. Inline credentials (only via direct CLI; never over MCP wire).
  2. OS keyring (macOS Keychain / Windows Credential Manager / Linux secret-service).
  3. ~/.fluid/sources.yaml (non-sensitive fields only — secrets live in keyring).
  4. Environment variables.
  5. Cloud metadata service (opt-in via allow_metadata_service=true).
  6. Fail-closed.

Defends against an upstream agent escalating "list tables" into "dump every secret" by replaying old credentials in a long-running MCP session.

Audit trail

Every catalog tool call writes an audit event via copilot/store/audit_trail.py:

{
  "timestamp": "2026-04-25T14:33:21.123Z",
  "event": "catalog.list_source_tables",
  "catalog_name": "snowflake",
  "credential_id": "snowflake-prod",
  "scope": {"database": "BIZ_LAB", "schema": "SEEDED"},
  "result_summary": {"table_count": 23},
  "duration_ms": 8421
}

Credentials are scrubbed before write — audit_context() returns non-sensitive fields only. Query the trail with:

fluid memory show audit
fluid memory show audit --filter catalog.list_source_tables --window 24h

Determinism guarantees

Catalog reads are deterministic — same catalog state, same scope, same JSON shape. The forge as a whole is near-deterministic:

  1. Cache hit — byte-identical output (cache key includes capability_matrix, so flipping a flag like extended_thinking invalidates cleanly).
  2. Cache miss, temp=0 — near-identical entity names, relationships, hash keys. Minor wording in description / ai_context.instructions may vary by 1-2 tokens.
  3. --deterministic flag — forces temp=0, seed=42 (where provider supports), cache off, emits proof-of-determinism report.

The capability_matrix segment in the cache key (added in V2 polish, Gap 7.3) is what makes "flipping a capability flag invalidates the cache cleanly" actually true:

generate_cache_key(
    model="claude-sonnet-4-6",
    prompt=prompt_blob,
    params=stage_params,
    capability_matrix={"extended_thinking": True},  # 4th hash segment
)

Without this, two runs with identical model/prompt/params but different capability matrices would collide on the same cached response — leaking a "no-thinking" answer into a "with-thinking" run. The hash segment makes them hash distinct.

See also

  • Catalogs index — user-facing walkthrough
  • Credential resolver — security model details
  • Cost tracking — V2 polish details
  • MCP server — full MCP walkthrough
Edit this page on GitHub
Last Updated: 5/17/26, 6:10 PM
Contributors: fas89, Claude Opus 4.7, Claude Opus 4.7 (1M context)
Prev
Guided fluid forge UX
Next
V1.5 + V2 Hardening — Release Notes