Task: Debug a failed pipeline run

It's 3am. PagerDuty fired. gold.finance.customer_360_v1 missed its 1-hour freshness SLA. You have 90 seconds to figure out where it broke, why it broke, what changed, fix it, and ship.

This is what fluid runs is built for. Three commands, one fix, one ship.

The 90-second flow

fluid runs status gold.finance.customer_360_v1
fluid runs logs gold.finance.customer_360_v1 --component dlq --run-id <first-fail-run>
fluid runs diff gold.finance.customer_360_v1 --build customer_metrics --run-a <last-ok-run> --run-b <first-fail-run>
# ...edit one line in contract.fluid.yaml...
fluid ship contract.fluid.yaml --strict --env prod --yes

A frame-perfect cast of this exact flow is in the day2-ops demo — bookmark it.

Step 1 — `runs status` (where)

fluid runs status gold.finance.customer_360_v1

Shows recent runs of this product:

run-id      ts          duration  status   stage
─────────  ──────────  ────────  ───────  ──────
r-2a4f8c3  03:01 AM    38 s      FAIL     apply
r-2a4f8c2  02:01 AM    41 s      FAIL     apply
r-2a4f8c1  01:01 AM    37 s      FAIL     apply
r-2a4f8c0  12:01 AM    4.2 m     OK
─────────────────────────────────────────────────
3 consecutive failures. First failure: r-2a4f8c1 (01:01 AM)
⚠ Stage: apply — DDL succeeded, build failed

What you learned in 5 seconds:

3 consecutive failures (not a transient fluke)
First failure was at 01:01 AM (so the change came in around then)
Stage is apply — the build inside apply is what's failing, not the IAM or schema layer

runs status shows the 5 most recent runs by default — pass --last 50 if you need more history, or --build <id> to scope to one build.

Step 2 — `runs logs --component dlq` (why)

fluid runs logs gold.finance.customer_360_v1 --component dlq --run-id r-2a4f8c1

The dlq component holds quarantined batches — rows that failed quality gates. --run-id pins the fetch to a specific run; omit it and runs logs reads the most recent run for the build:

01:01:23  build  ERROR  CHECK constraint failed:
01:01:23           arpu_30d_eur expected NOT NULL,
01:01:23           got 47 nulls in 12,408 rows
01:01:23  build  ERROR  Quarantined to dlq:
01:01:23           s3://forge-runtime/dlq/r-2a4f8c1/...
01:01:23  apply  FAIL   Hard gate: dq.rules NOT_NULL violated

Now you know:

The failing rule is arpu_30d_eur NOT_NULL
47 of 12,408 rows are violating
The data is preserved in S3 DLQ for re-processing

Other components you can ask about: --component build, --component infra, --component server, --component worker. Default is --component build. Add --grep <pattern> to filter log lines, or --limit <n> to cap how many are returned (default 1000).

Step 3 — `runs diff` (what)

fluid runs diff gold.finance.customer_360_v1 --build customer_metrics --run-a r-2a4f8c0 --run-b r-2a4f8c1

Compares the last successful run to the first failed one — what changed between the two:

sources:
  + eu_signups_q4     (new region added 12h before first fail)
metrics:
    ~ arpu_30d_eur    now sees < 30 day customers
─────────────────────────────────────────────────
✓ Drift surface: 1 source added · 1 metric scope changed
ℹ dq.rules unchanged — they're correct, but assume 30 days of data

Now you have the full picture:

A new EU region was added 12h before the first failure
The metric's scope expanded to include customers younger than 30 days
The NOT_NULL rule is correct, but the data lifecycle changed

The fix is not removing the rule — it's making the rule respect the lifecycle.

Step 4 — fix (build SQL + rule threshold)

The schema's dq.rule shape doesn't carry a where: clause — instead, push the lifecycle logic into the SQL build (where it belongs) and relax the rule's threshold to acknowledge that some partial-window rows will be NULL by design.

4a. Update the build's SQL to emit NULL for customers younger than 30 days:

# contract.fluid.yaml
builds:
  - id: customer_metrics
    pattern: embedded-logic
    engine: sql
    properties:
      sql: |
        SELECT
          customer_id,
          customer_age_days,
          -- Only emit arpu_30d_eur once the customer has 30 days of history
          CASE
            WHEN customer_age_days >= 30 THEN COALESCE(arpu_30d_eur_raw, 0)
            ELSE NULL
          END AS arpu_30d_eur
        FROM raw.customers c
        LEFT JOIN raw.transactions t USING (customer_id)

4b. Relax the rule's threshold so partial-window rows don't fail the gate:

exposes:
  - exposeId: customer_360_table
    contract:
      dq:
        rules:
          - id: arpu_30d_eur_completeness
            type: completeness
            selector: arpu_30d_eur
            # threshold: 1.0  ← the bad version (failed on every <30-day customer)
            threshold: 0.85    # ← new: 85% of all rows have non-null arpu
            operator: ">="
            severity: error
            description: "arpu_30d_eur is intentionally NULL for customers younger than 30 days; 85% threshold accommodates the partial-window cohort"

fluid validate confirms the rule still parses:

fluid validate contract.fluid.yaml --strict
# ✓ Schema 0.7.3 — passed
# ✓ dq.rules — 8 rules, 1 changed, no breaking moves
# ✓ Contract validation passed (strict)

Step 5 — `ship` (apply + verify + drain DLQ + restore SLA)

fluid ship contract.fluid.yaml --strict --env prod --yes

ship is the canonical "I'm fixing an incident, do all the right things" command:

⏳ Validating contract... ✓
⏳ Rendering plan... ✓ (plan checksum: a4f8c4...)
⏳ Applying...
✓ BigQuery DDL applied (no destructive changes)
⏳ Re-running quarantined batch from r-2a4f8c1 dlq...
✓ Recovered 12,361 rows (47 still in dlq, fallback applied)
✓ Freshness SLA restored: last successful run = 12 s ago
✓ Ship complete in 87 seconds — incident closed

What ship did, in order (per the canonical 4-stage chain — see fluid ship):

validate (schema check)
bundle (skipped here via --skip-bundle if you don't need a snapshot)
plan (deterministic, plan-bound)
apply (idempotent re-application of the contract)

After apply, the runtime drains the DLQ from the failed run and re-runs verify against the deployed state. Audit records ship to the platform's native audit channel (BigQuery audit log / Snowflake ACCESS_HISTORY / CloudTrail) with the run ID, the contract checksum, and the deployed bindings.

For SOX-grade change tracking, commit the contract change behind a PR — the merged commit is the audit record. git log contract.fluid.yaml is the change history; fluid runs status is the runtime evidence.

What about the 47 still in DLQ?

Those rows had customer_age_days >= 30 AND null arpu_30d_eur_raw — a real data quality issue (mature customer, missing transactions). The new build emits 0 only via COALESCE(arpu_30d_eur_raw, 0), but only when there's no transactions row at all; if the row exists with NULL already, the COALESCE returns 0 and the rule passes. The 47 are likely transactions-missing-for-the-customer cases — fix upstream or accept as a known quality miss.

Run:

fluid runs logs gold.finance.customer_360_v1 --component dlq --run-id r-<ship-run-id>

…to see them, then either fix upstream or accept them as a known quality miss.

What you DIDN'T have to do

Open the BigQuery web console and try to figure out what changed
Diff Terraform state against actual deployed state
Translate the schema-validator error into operator language
Write a one-off SQL script to manually patch the 12,361 row
Update three different tools with the same fix
Wake up your platform engineer

Common patterns this enables

Pre-merge CI: fluid runs status <product-id> --last 1 --json in CI catches a flaky build before it merges
Weekly health audit: fluid runs diff <product-id> --build <id> --run-a <last-week-run> --run-b <today-run> for every product surfaces drift early
Post-incident review: fluid runs logs <product-id> --run-id <run-id> --component build > incident.log is the audit artifact