dbt Internals and Best Practices: What Happens When You Run dbt run

Everyone uses dbt. Fewer people understand what it actually does. It's not a query engine, not an orchestrator, and not a data warehouse — it's a SQL compiler with a dependency resolver and a test runner built in. Understanding the compilation pipeline is the difference between a dbt project that runs reliably at scale and one that collapses into a spaghetti of {{ ref() }} calls nobody can debug.

This article covers dbt's internal compilation pipeline (what happens between dbt run and the first SQL statement hitting your warehouse), materializations and how they differ on Snowflake vs Databricks vs Redshift vs BigQuery, incremental strategies in depth, common project structure mistakes, and how dbt performs on modern lightweight engines like DuckDB and MotherDuck for development and testing.

The Compilation Pipeline: Jinja → SQL → Execution

When you run dbt run, dbt does five things before executing a single SQL statement against your warehouse:

graph LR
    A["1. Parse project\n(dbt_project.yml +\nall .sql .yml files)"]
    B["2. Resolve\ndependencies\n(build DAG from ref()/source())"]
    C["3. Jinja render\n(compile templates\nto raw SQL)"]
    D["4. Adapter\ntranslation\n(dialect-specific SQL)"]
    E["5. Execute\n(warehouse API calls\nin topological order)"]

    A --> B --> C --> D --> E

dbt's five-phase execution pipeline. Steps 1–4 happen on your local machine or dbt Cloud runner — no warehouse queries. Only step 5 hits the warehouse. This means compilation errors (syntax, missing refs, undefined variables) are caught before any compute is consumed.

Phase 1: Parse

dbt reads every .sql, .yml, and .py file in your project, parses model configs, source definitions, macros, and tests. This phase builds the internal graph — a Python dict mapping node IDs to their metadata. At this point dbt knows every model exists but hasn't compiled any SQL yet.

Phase 2: DAG Resolution

The {{ ref('model_name') }} function is a dependency declaration, not a SQL expression. dbt resolves all ref() and source() calls to their fully qualified table names (e.g., analytics.prod.orders_daily) and builds a directed acyclic graph. If model B refs model A, B depends on A — and dbt will execute A before B. Circular references fail here with a clear error.

Phase 3: Jinja Compilation

dbt uses Jinja2 to render model SQL. Every model file is a Jinja template; {{ ref('stg_orders') }} resolves to the fully-qualified table name, {{ config() }} sets model properties, {{ is_incremental() }} evaluates to True/False based on run context. The output is plain SQL with no Jinja syntax — this is what dbt compile shows you and what actually runs against the warehouse.

-- models/marts/fct_orders.sql (Jinja template)
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
) }}

SELECT
    order_id,
    customer_id,
    order_total,
    created_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

-- Compiled output (what dbt sends to Snowflake)
MERGE INTO analytics.prod.fct_orders AS DBT_INTERNAL_DEST
USING (
    SELECT order_id, customer_id, order_total, created_at
    FROM analytics.prod.stg_orders
    WHERE created_at > (SELECT MAX(created_at) FROM analytics.prod.fct_orders)
) AS DBT_INTERNAL_SOURCE
ON DBT_INTERNAL_DEST.order_id = DBT_INTERNAL_SOURCE.order_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

Phase 4: Adapter Translation

The adapter layer translates dbt's dialect-independent compiled SQL into the specific SQL dialect of your warehouse. This is where the differences between Snowflake, Redshift, BigQuery, and Databricks become relevant. The same merge incremental strategy compiles to different SQL on each platform — Snowflake uses MERGE INTO, BigQuery uses MERGE with slightly different syntax, Databricks uses MERGE INTO on Delta tables, Redshift uses a delete-then-insert pattern (no native MERGE until recently).

Materializations: What They Actually Do

Table

Drops and recreates the table on every run: DROP TABLE IF EXISTS; CREATE TABLE AS SELECT .... Simple, reliable, expensive for large tables. Use for small dimension tables, reference data, and marts with relatively fast rebuild times.

View

Creates or replaces a view: CREATE OR REPLACE VIEW AS SELECT .... No storage cost. Query performance depends entirely on the upstream tables. Use for staging models and lightweight transformations where freshness matters more than query speed.

Incremental

The most complex and most misunderstood materialization. On the first run (when the table doesn't exist), behaves like a full table build. On subsequent runs, only processes new/changed rows. The exact SQL depends on the incremental_strategy:

Strategy	Mechanics	Best for	Supported on
`append`	INSERT new rows only. No deduplication.	Immutable event logs	All adapters
`merge`	MERGE ON unique_key: update matches, insert new. Idempotent.	Fact tables with updates	Snowflake, BigQuery, Databricks (Delta), Postgres
`delete+insert`	DELETE matching rows, INSERT all new rows. Slower than merge but more compatible.	Replacing partitions	All adapters (Redshift prefers this)
`insert_overwrite`	Replace entire partition(s) atomically. Partition-level idempotency.	Large time-partitioned tables	BigQuery, Databricks, Spark
`microbatch`	dbt 1.9+: Processes each time batch separately with retries. Event-time-based.	Reliable large-scale incremental with late data	BigQuery, Snowflake, Databricks (1.9+)

Ephemeral

Not materialized at all — compiled as a CTE and inlined into the downstream model's SQL. No storage, no warehouse object, no way to query it directly. Use for intermediate transformations you'd otherwise write as subqueries. The downside: complex ephemeral chains make debugging hard because you can't SELECT * FROM that_intermediate_step to check what's in it.

Platform Differences That Actually Matter

Snowflake

The reference dbt platform. Every incremental strategy works reliably. Dynamic Tables (Snowflake-native streaming materialization) can replace some incremental models for near-real-time use cases. CLUSTER BY is the Snowflake equivalent of partitioning — specify it in dbt config for large fact tables. Snowflake's ZERO_COPY_CLONE capability (cloning a table without copying data) makes dbt test --store-failures very cheap — failed test results get cloned into a test schema without data movement.

Databricks (Delta Lake)

Databricks' Delta Lake supports all dbt incremental strategies plus liquid clustering (Delta's partitioning evolution, replacing static PARTITIONED BY). The insert_overwrite strategy replaces Spark partitions atomically and is often faster than MERGE on large tables. Python models (dbt-databricks >= 1.3) let you write Spark DataFrame code as a dbt model — useful for ML feature engineering that's easier in PySpark than SQL. One gotcha: Unity Catalog on Databricks changes table naming from database.schema.table to catalog.schema.table — updating your dbt_project.yml database config is required when migrating to Unity Catalog.

Redshift

Redshift lacks a native MERGE statement until Redshift Serverless (and even there it's limited). dbt's default incremental strategy on Redshift is delete+insert: it deletes rows matching the unique_key, then inserts all new rows. This creates vacuum bloat — rows deleted from column-store pages leave ghost rows that consume space and slow queries until VACUUM runs. Production Redshift dbt projects need a scheduled VACUUM ANALYZE on incremental tables, or they'll degrade significantly over time. Redshift's SORT KEY is the performance lever dbt exposes via the sort config — critical for range-based query patterns on large fact tables.

BigQuery

BigQuery partitioned tables + dbt's insert_overwrite strategy is the standard production pattern: each run replaces the affected date partition(s) atomically. Partition replacement is idempotent and handles late-arriving data correctly. BigQuery's require_partition_filter table option (prevents full-table scans — every query must include a partition filter) can be set via dbt config and is strongly recommended for any table over a few hundred GB.

Azure Synapse / SQL Server

The dbt-synapse adapter is maintained by Microsoft and covers most dbt features, but Synapse Dedicated SQL Pool lacks some Delta Lake features and has different MERGE semantics. External tables from ADLS Gen2 are handled via source configs but have query performance limitations. For Azure-native dbt work in 2024, Fabric Lakehouse (via the dbt-fabric adapter) is increasingly the preferred target over Synapse Dedicated SQL Pool, which Microsoft is steering users away from.

dbt on Modern Open/Lightweight Engines

DuckDB

dbt-duckdb made dbt development genuinely fast. DuckDB is an in-process OLAP engine (runs inside your Python process or CLI, reads Parquet/CSV/JSON directly from disk or S3) that executes SQL at warehouse speeds on a laptop. With dbt-duckdb, you can run your entire dbt project locally against a sample of production data in seconds — no warehouse needed, no costs, no cloud credentials. This changes the development loop: instead of writing a model, pushing to a branch, waiting for a CI job, and paying for warehouse compute to run a test, you write a model, run it locally in 3 seconds, and iterate.

# profiles.yml — DuckDB for local dev
my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: "./dev.duckdb"       # local file, or :memory: for ephemeral
      # Can read Parquet directly from S3 without loading into the DB
      settings:
        s3_region: us-east-1

    prod:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      ...

MotherDuck

MotherDuck is serverless DuckDB in the cloud. Same SQL dialect and dbt-duckdb adapter, but your data lives in a managed cloud database rather than a local file. The hybrid execution model is particularly clever: you can join a local DuckDB table with a MotherDuck cloud table in a single query, running the compute nearest the data. For dbt projects, this means a development environment that's cloud-hosted (no local file management) but still orders of magnitude cheaper than Snowflake or BigQuery for development and testing workloads.

Project Structure: The Patterns That Scale

The standard dbt project structure follows the Fivetran/dbt Labs "jaffle shop" pattern, which has become the community default for good reason:

models/
├── staging/          # 1-1 with source tables; rename, recast, light cleaning
│   ├── salesforce/
│   │   ├── _salesforce__sources.yml
│   │   ├── _salesforce__models.yml
│   │   ├── stg_salesforce__contacts.sql
│   │   └── stg_salesforce__opportunities.sql
│   └── stripe/
│       ├── stg_stripe__charges.sql
│       └── stg_stripe__customers.sql
├── intermediate/     # reusable building blocks; not exposed directly to BI
│   ├── int_customer_order_rollup.sql
│   └── int_revenue_by_month.sql
└── marts/            # business-logic models; exposed to BI and downstream
    ├── core/
    │   ├── dim_customer.sql
    │   └── fct_orders.sql
    └── finance/
        └── fct_mrr.sql

The staging layer is a 1:1 mapping from source system to a clean, typed, consistently named table. Rename acct_no to account_number here. Cast types here. Do nothing else here. The discipline of keeping staging models thin is what makes the intermediate and marts layers predictable.

The 10 Most Common dbt Project Mistakes

Putting business logic in staging models. Staging is for cleaning, not business rules. The moment you write a CASE WHEN with business logic in a staging model, it becomes invisible to anyone looking at the marts layer.
Not using contracts on mart models. dbt contracts (enforced column definitions + data types) prevent silent schema changes from breaking downstream BI reports. Every model exposed to a BI tool or external consumer should have a contract.
Incremental models without +full_refresh protection. A misbehaving incremental model that needs a full refresh will silently contain stale data until someone manually runs --full-refresh. Add a freshness assertion to your tests.
Over-relying on ephemeral models for debugging. Ephemeral models are invisible at query time. Large chains of ephemeral models produce enormous CTEs that are impossible to profile or debug in the warehouse.
Missing on_schema_change config. When an incremental model's source adds a new column, the incremental table won't include it until a full refresh — unless you set on_schema_change: 'sync_all_columns'.
Using {{ ref() }} in macros. Macros don't participate in the dependency graph. Using ref() inside a macro doesn't create a DAG dependency — the referenced model might not exist yet when the macro runs.
No separation of dev and prod targets. Running dbt in dev should write to a dev schema, not prod. Misconfigured profiles that point dev runs at prod tables have caused real data loss.
Testing only for not-null and unique. These are the minimum. Accepted-values, relationship, and custom expression tests are what catch real data quality issues.
Circular source definitions. Defining a staging model as a source of another staging model (instead of using ref()) breaks lineage tracking and creates hidden dependencies.
Giant models. A model with 500 lines of SQL doing 15 joins is not a model — it's a monolith. Break it into intermediate models. Future debuggers will thank you.

The one dbt feature most projects underuse: dbt docs generate and dbt docs serve. A documented, auto-generated data catalog with lineage graphs, column descriptions, and test coverage is a significant data governance artifact — and it's free if you write YAML descriptions. The organizations that document their dbt models in YAML have fundamentally different data literacy than those that don't.

dbt Semantic Layer (MetricFlow)

dbt 1.6+ introduced the Semantic Layer (powered by MetricFlow), which lets you define metrics — not just SQL transformations — in dbt. A metric defined in dbt can be queried by BI tools (Tableau, Looker, Hex, Evidence) via a consistent interface, ensuring the number means the same thing everywhere it appears.

# models/metrics/revenue_metrics.yml
semantic_models:
  - name: orders
    model: ref('fct_orders')
    entities:
      - name: order
        type: primary
        expr: order_id
      - name: customer
        type: foreign
        expr: customer_id
    dimensions:
      - name: order_date
        type: time
        type_params:
          time_granularity: day
    measures:
      - name: order_total
        agg: sum
        expr: order_total

metrics:
  - name: total_revenue
    type: simple
    label: Total Revenue
    type_params:
      measure: order_total
  - name: revenue_30d
    type: cumulative
    label: Revenue (30-day rolling)
    type_params:
      measure: order_total
      window: 30 days

The Semantic Layer is dbt's answer to the "metrics are defined differently in every BI tool" problem. It's still maturing, but for teams running multiple BI tools consuming the same dbt project, it's the right architectural direction — define the metric once, query it consistently everywhere.