End-to-End Data Vault on Snowflake: dbt, Openflow, Data Quality & Governance

Most Data Vault writing stops at modeling — hubs, links, satellites, hash keys. That's the interesting part intellectually, but it's maybe a third of what it takes to run a Data Vault in production. The other two-thirds are the parts nobody demos: how data gets in, how you know it's correct, and how you govern it when 40 sources and a dozen teams depend on it. This article is about the whole pipeline — ingestion to governed marts — on Snowflake, with dbt orchestrating the transformations and Openflow handling ingestion.

If you want the modeling fundamentals — the hash-key debate, satellite variants, PIT/bridge tables — I covered those in Building a Data Vault 2.0 on Snowflake with dbt. This piece assumes you know the entities and focuses on making the pipeline operable: end-to-end flow, data quality you can prove, and governance that's enforced rather than documented.

What's new since the modeling article: Snowflake Openflow went GA (fully-managed Snowflake Deployments, November 2025), giving a native managed-ingestion layer that fits Data Vault's multi-source reality; and Snowflake's Data Metric Functions (DMFs) plus the Horizon Catalog turn data quality and governance into platform features rather than bolt-ons. This article puts those together with dbt into one operable pipeline.

The End-to-End Picture

graph LR
    subgraph Sources["Many sources"]
        SAAS["SaaS apps\n(Salesforce, etc.)"]
        DB["Operational DBs\n(CDC)"]
        FILES["Files / unstructured"]
    end
    subgraph Ingest["① Ingestion — Openflow"]
        OF["Openflow (NiFi)\nconnectors + CDC\nmanaged on SPCS"]
    end
    subgraph Snowflake["Snowflake"]
        LAND["Landing / source tables\n(VARIANT-friendly, raw)"]
        STG["② Staging (dbt)\nhash keys + hashdiff\nload_dts, rec_src"]
        RV["③ Raw Vault (dbt)\nHubs · Links · Sats\ninsert-only, source-faithful"]
        BV["④ Business Vault (dbt)\nderived sats, PITs, bridges"]
        IM["⑤ Information Marts (dbt)\nstar schemas / secure views"]
    end
    DQ["DMFs + dbt tests\n(quality at every layer)"]
    GOV["Horizon Catalog\ntags · masking · row access · lineage"]

    SAAS --> OF
    DB --> OF
    FILES --> OF
    OF --> LAND --> STG --> RV --> BV --> IM
    DQ -.monitors.-> RV
    DQ -.monitors.-> IM
    GOV -.governs.-> RV
    GOV -.governs.-> IM

The full pipeline. Openflow lands raw data from many sources; dbt transforms it through staging, raw vault, business vault, and marts; Data Metric Functions and dbt tests enforce quality at each layer; and the Horizon Catalog applies tags, masking, row-access policies, and lineage across the whole thing. The Data Vault is the spine; ingestion, quality, and governance are the parts that make it production-grade.

① Ingestion with Openflow

Data Vault is built for many sources, and ingestion is where multi-source pipelines usually rot — a tangle of bespoke connectors, cron jobs, and CDC scripts each failing in its own way. Openflow (Snowflake's managed, Apache NiFi-based integration service) consolidates that. Running as a fully-managed Snowflake Deployment on Snowpark Container Services, it offers hundreds of processors and prebuilt connectors for SaaS apps, databases (CDC), streaming, and unstructured content, all landing into Snowflake under one governed, observable service.

For a Data Vault specifically, Openflow does three valuable things:

Lands source data faithfully. The raw vault's whole value is being source-faithful, so the ingestion layer must not transform or clean — it should land data as the source delivered it, stamped with arrival metadata. Openflow flows do exactly this: capture and land, leave interpretation to dbt downstream.
Handles CDC for operational sources. Change data capture from operational databases maps perfectly to Data Vault's insert-only, history-preserving model — each change becomes a new satellite version. Openflow's CDC connectors remove the need to operate Debezium/Kafka Connect yourself.
Brings unstructured data in for the increasingly common case where the vault feeds downstream AI — documents and multimodal content landed alongside structured records.

Keep the boundary clean: Openflow lands, dbt transforms. The single most important discipline at the ingestion boundary is to resist doing "just a little" cleaning in the flow. Every transformation in ingestion is a transformation you can't audit in the vault. Land raw (a VARIANT column for semi-structured payloads works well), stamp load_dts and rec_src, and let staging — version-controlled dbt SQL — do all interpretation. This keeps the raw vault provably faithful to source.

② → ⑤ Transformation with dbt

dbt owns everything from landing to marts. The layering mirrors the standard Data Vault architecture, and dbt's DAG expresses the dependencies and parallelism for free (see the modeling article for the entity-level detail and automateDV macros). The key operational properties:

Staging computes hash keys and hashdiffs and stamps metadata — the one place real SQL lives and the one place hashing rules are enforced centrally.
Raw Vault models are insert-only incrementals generated by automateDV; no business logic, ever.
Business Vault applies derivations and adds query-acceleration structures (PITs, bridges) — built reactively, when query patterns justify them.
Information Marts are the consumable star-schemas/secure-views. This is the only layer consumers and BI tools touch.

Because the raw-vault loads have no cross-dependencies, they parallelize across dbt threads against Snowflake's insert-friendly, immutable micro-partition storage (the reason that fit is so good is covered in Snowflake internals). Separate warehouses per layer keep loading, transformation, and consumption workloads isolated and individually cost-attributable.

③ Data Quality: Prove It, Don't Hope It

A Data Vault accumulates history forever, which means quality problems also accumulate forever if you don't catch them. Quality in this stack comes from two complementary layers — dbt tests at build time and Snowflake Data Metric Functions at rest.

dbt tests — build-time, Data-Vault-specific

Generic not_null/unique tests miss the failures that actually matter in a vault. Test the grain and the keys:

Hub uniqueness — exactly one row per business key (and per hash key). A duplicate here means inconsistent hashing.
Link grain — uniqueness on the link hash; no accidental fan-out.
Satellite integrity — every satellite key resolves to a hub/link (relationship tests), and hashdiff behavior is correct (no duplicate consecutive versions with identical hashdiffs).
Referential consistency across the vault — no orphan satellites, no links pointing at absent hubs.

# models/raw_vault/_raw_vault.yml
models:
  - name: hub_customer
    columns:
      - name: hk_customer
        tests: [unique, not_null]        # one row per business key
      - name: customer_id
        tests: [not_null]
  - name: sat_customer_details
    tests:
      # satellite keys must exist in the hub
      - relationships:
          to: ref('hub_customer')
          field: hk_customer
          column_name: hk_customer
    columns:
      - name: hashdiff
        tests: [not_null]

Data Metric Functions — at-rest, continuous

dbt tests fire when models build. But data also degrades between builds, and consumers care about the marts, not your DAG. Snowflake's Data Metric Functions (DMFs) monitor quality continuously on the tables themselves and surface results through Horizon. Built-in system DMFs cover the common dimensions — freshness, volume, uniqueness, null counts, accepted values — and you can write custom DMFs for business-specific rules.

-- Continuously monitor the customer mart for freshness and uniqueness.
-- Schedule the metric to evaluate when the table changes.
ALTER TABLE marts.dim_customer
  SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';

-- Built-in: how many NULL emails are slipping through?
ALTER TABLE marts.dim_customer
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT ON (email);

-- Built-in: is the business key actually unique in the mart?
ALTER TABLE marts.dim_customer
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.DUPLICATE_COUNT ON (customer_id);

-- Built-in: how stale is the mart? (minutes since last update)
ALTER TABLE marts.dim_customer
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.FRESHNESS ON (updated_at);

-- Results land in SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
-- → alert on them with a task, or surface in Horizon / your catalog.

The division of labor: dbt tests gate the build (a failing test stops bad data propagating), while DMFs monitor the live tables (catching freshness gaps, upstream drift, and slow degradation that a point-in-time build test wouldn't). Use both — they answer different questions.

Monitor the marts, test the vault. Put your heaviest dbt testing on the raw/business vault (where structural correctness is defined) and your DMF monitoring on the information marts (where consumers feel problems). Monitoring every satellite with DMFs is expensive and rarely actionable; monitoring the marts consumers actually query gives you the signal that matters at a fraction of the cost.

④ Governance with Horizon

A vault integrating 40 sources is, by definition, a concentration of sensitive data — PII from CRMs, financials, health or payment data. Governance can't be a spreadsheet of policies; it has to be enforced by the platform. Snowflake's Horizon Catalog unifies the pieces, and crucially they're connected: classify a column and that classification flows into lineage, masking, and discovery automatically.

graph TD
    subgraph Horizon["Horizon Catalog (unified governance)"]
        CLASS["Auto-classification\n+ sensitive-data tags (PII, etc.)"]
        MASK["Masking policies\n(dynamic, tag-driven)"]
        ROW["Row-access policies\n(tenant / region scoping)"]
        LIN["Column-level lineage\nsource → vault → mart"]
        DQ2["Data quality (DMFs)\nresults in context"]
    end
    CLASS --> MASK
    CLASS --> LIN
    MASK -.enforced on.-> MARTS["Information marts\n(+ Iceberg via REST catalog)"]
    ROW -.enforced on.-> MARTS
    LIN -.traces.-> SRC["back to source system"]

Horizon ties classification, masking, row-access, lineage, and data-quality into one model. Tag a column as PII once and the tag-driven masking policy applies everywhere it flows; lineage lets you trace any mart column back through the vault to its source system — the exact capability regulated industries need to prove provenance.

How the governance layers map onto a Data Vault:

Classification & tagging — auto-classify sensitive columns at the staging/raw-vault boundary so tags propagate as data flows into satellites and marts. Tag once, near the source.
Tag-driven masking — attach dynamic masking policies to tags (not individual columns), so any PII-tagged column is masked consistently wherever it surfaces, including in marts shared downstream.
Row-access policies — enforce tenant, region, or business-unit scoping on the marts, so a shared mart shows each consumer only their slice.
Column-level lineage — Horizon traces a mart column back through the business and raw vault to the originating source. This is the "prove where this number came from" capability that makes Data Vault attractive to regulated industries in the first place (and complements the evidence-layer argument in the banking governance piece).

-- Tag-driven masking: define once, applies everywhere the tag appears.
CREATE MASKING POLICY mask_pii_email AS (val STRING) RETURNS STRING ->
  CASE WHEN IS_ROLE_IN_SESSION('PII_READER') THEN val
       ELSE REGEXP_REPLACE(val, '.+@', '****@') END;

-- Associate the policy with a governance tag, not a column.
ALTER TAG governance.pii SET MASKING POLICY mask_pii_email;

-- Now simply tagging any email column enforces masking downstream.
ALTER TABLE marts.dim_customer MODIFY COLUMN email
  SET TAG governance.pii = 'email';

Govern the marts, share the marts. As with quality, concentrate enforcement where data is consumed. The raw vault stays private (it's unusable without deep Data Vault knowledge and leaks source internals); the information marts are the governed, masked, row-scoped sharing surface. Snowflake masking and row-access policies on marts are even enforced for external engines reading via the Iceberg REST catalog — so the governance holds whether a consumer queries through Snowflake or through Spark/Trino.

Best Practices

Land raw, transform in dbt. No cleaning in Openflow. The raw vault's audit value depends on the ingestion layer being a faithful capture, not a transformer.
Centralize hashing rules. Normalization (case, trim, null token, delimiter) defined once and reused everywhere — inconsistent hashing is the bug that silently produces duplicate hubs. automateDV's globals enforce this.
Test the vault, monitor the marts. Heavy dbt testing on vault structure (keys, grain, relationships); DMF monitoring on the marts consumers query. Different layers, different failure modes.
Tag sensitive data near the source. Classify and tag at staging/raw-vault so masking and lineage propagate automatically — retrofitting governance onto a mature vault is painful.
Separate warehouses per layer. Ingestion-adjacent loads, transformation, and consumption on isolated warehouses for workload isolation and clean cost attribution.
Build PITs/bridges and add DMFs reactively. Both are ongoing cost. Add them where measured query patterns or quality incidents justify them, not preemptively.
Version the contracts, not just the code. Publish stable mart contracts so downstream teams aren't coupled to internal vault structure; the mart absorbs vault changes invisibly.

Lessons Learned

Ingestion is where multi-source projects actually fail. Teams over-invest in elegant modeling and under-invest in robust, observable ingestion, then drown in connector failures. A managed layer (Openflow or equivalent) with built-in observability pays back fast at 40 sources.
Quality has two clocks. Build-time (dbt tests) and at-rest (DMFs) catch different problems. Teams that only do build-time tests miss freshness/drift on live marts; teams that only do DMFs let structurally-broken data into the vault. You need both.
Governance retrofits are brutal. Classifying and tagging PII after a vault is mature — across hundreds of satellites — is far harder than tagging at the source boundary from day one. Bake it into staging.
Lineage is the feature that sells Data Vault internally. When an auditor or regulator asks "where did this number come from," column-level lineage from mart back to source is the answer that justifies the whole methodology. Wire Horizon lineage in early; it's also your impact-analysis tool when a source changes.
The marts are the product. The recurring failure mode is a beautiful vault with no usable consumption layer. Ingestion, vault, quality, and governance all exist to deliver trustworthy marts — budget real effort there, because it's the only layer anyone outside the data team ever sees.

The Bottom Line

A production Data Vault on Snowflake in 2026 is no longer just a modeling exercise — the platform now supplies the operational pieces that used to be bespoke. Openflow gives you managed, observable, multi-source ingestion that lands data faithfully. dbt orchestrates the vault and marts with version control and a dependency-aware DAG. Data Metric Functions plus dbt tests make quality provable at both build time and at rest. And Horizon makes governance — classification, masking, row access, and lineage — an enforced property of the platform rather than a document nobody reads.

Get the modeling right (start with the modeling guide), then treat ingestion, quality, and governance as first-class engineering — not afterthoughts. That's the difference between a Data Vault that's a compliance liability and one that's the trustworthy, auditable, change-absorbing foundation it's supposed to be.