VertiPaq Internals: What's Really Happening When Power BI Loads Your Model

Every time you click "Refresh" in Power BI, something remarkable happens in the background. A 200 MB Excel file becomes a 40 MB in-memory structure that answers complex aggregations in milliseconds. The engine responsible for this transformation is VertiPaq — and most people who use it every day have no idea how it actually works. That ignorance costs them hours of slow report load times and models that could be 70% smaller if they knew what the engine needs.

Let's fix that.

A Brief History of xVelocity / VertiPaq

VertiPaq didn't start as a Power BI thing. It began life in 2009 as the engine behind PowerPivot — that Excel add-in that let analysts jam millions of rows into a spreadsheet and still have Excel not crash. The original internal codename was "Gemini," which someone probably thought was very clever.

In 2012, Microsoft renamed the engine "xVelocity" for the SQL Server 2012 launch. This was purely a marketing exercise — the engine itself hadn't changed significantly. Shortly after, because xVelocity was apparently too hard for people to say, the columnar in-memory part went back to being called VertiPaq, while xVelocity lingered as an umbrella term that covered both the in-memory path (VertiPaq) and the ColumnStore Index path in SQL Server. Confusing? Absolutely. Microsoft naming things is a sport with unusual rules.

By 2015, VertiPaq was the heart of SSAS Tabular mode, Power BI Desktop, and Excel Power Pivot — all running the same engine with different front-ends bolted on. Fast-forward to 2025, and the same engine runs inside Microsoft Fabric's semantic models, though V-Order optimizations on the Parquet side mean that cold-start loads are dramatically faster than they were five years ago.

Columnar Storage: Why Columns Beat Rows for Analytics

Traditional row-oriented databases store data the way you'd read a CSV: one row at a time, all columns together. That's great for OLTP workloads where you fetch individual records. It's catastrophic for analytics where you want to sum one column across 10 million rows — you're forced to read all the other columns you don't care about, burning memory bandwidth.

VertiPaq stores each column separately in memory. When DAX asks "what's the sum of Sales Amount?", the engine reads only the Sales Amount column vector — nothing else. For a table with 30 columns, that's potentially a 30x reduction in data touched. On modern CPUs where memory bandwidth is often the bottleneck for analytics workloads, this is not a minor optimization. It's the whole ballgame.

Each column in VertiPaq is stored as a contiguous array of integers (yes, always integers — more on that in a moment). The contiguous layout means sequential memory access, which is enormously cache-friendly. VertiPaq also leverages SIMD (Single Instruction, Multiple Data) CPU instructions to process multiple values per clock cycle, which is why a properly optimized model feels instantaneous even on commodity hardware.

The Three Encoding Types

The clever part of VertiPaq is that it doesn't just store your data — it transforms it. Every column goes through an encoding phase at refresh time, and the engine picks the best encoding based on the column's characteristics. The goal is always the same: represent every value as a small integer, so the column fits in CPU cache and compresses aggressively.

Value Encoding

For integer columns with reasonable ranges. Subtract the minimum value from every entry. A column with values [1000, 1001, 1002, 1003] becomes [0, 1, 2, 3] — needing far fewer bits. Only works on integer types.

Dictionary Encoding

For high-cardinality strings and anything non-integer. Build a sorted dictionary of unique values, then store integer indexes into that dictionary. "London", "Paris", "London" becomes dict=[London,Paris], data=[0,1,0].

Run-Length Encoding (RLE)

Applied after value or dictionary encoding. Replaces [0,0,0,0,1,1,1] with [(0,4),(1,3)] — a count of consecutive identical values. Catastrophically effective on low-cardinality columns sorted in a smart order.

In practice, most columns get dictionary + RLE. The dictionary encoding converts everything to integers, and then RLE compresses the resulting integer stream. What makes or breaks compression is sort order: if all rows for "London" are adjacent, RLE compresses them into a single run. If they're scattered randomly, RLE is useless. This is why sorting dimension tables by their most-filtered columns is one of the highest-leverage VertiPaq optimizations — it costs almost nothing and can halve model size.

Practical implication: When you see a Date column in VertiPaq with a cardinality of 1,826 (five years of dates), it's perfectly fine — dates compress well because they're inherently sorted. A GUID column with 10 million unique values? That's a dictionary with 10 million entries, no compression possible, and a hashmap that takes up RAM proportional to cardinality. If you don't need to filter or relate on that GUID, don't include it in your model.

Segments: The Unit of Work

VertiPaq doesn't store an entire column as one monolithic array. It splits each column into segments of up to 8 million rows (the exact size can vary). A segment is the atomic unit of work for the Storage Engine — it's the smallest chunk that can be scanned in parallel. Within a segment, RLE encoding is applied independently, which is why the ordering of rows within each segment matters for compression, not just global sort order.

Partitioning a table in VertiPaq (SSAS Tabular feature, also available in Fabric Premium semantic models) lets you split a fact table across multiple partitions. Each partition holds its own set of segments. This is primarily useful for incremental refresh — you can re-process only the partition covering recent data rather than re-encoding the entire table. It also enables parallel refresh at the partition level, which on large models with expensive refresh pipelines is a real operational win.

One subtle point: VertiPaq rebuilds all relationship hashmaps whenever any table in a join changes. If you frequently refresh small partitions, you're triggering hashmap rebuilds at each refresh. For models with hundreds of relationships across dozens of tables, this rebuild cost can be significant. Scope your partitions to the tables that actually change.

The Two-Engine Architecture: SE and FE

Here's something that surprises most Power BI developers the first time they look at DAX Studio query plans: there are two entirely separate engines processing your DAX queries, and they have almost nothing in common architecturally.

flowchart TD
    DAX["DAX Query"] --> FE

    subgraph FE["Formula Engine — Single Thread"]
        direction TB
        Parser["Query Parser / AST Builder"]
        Plan["Logical Plan Generator"]
        Cache["FE Result Cache"]
        Parser --> Plan --> Cache
    end

    subgraph SE["Storage Engine — Multi-Threaded"]
        direction TB
        Scan1["Segment Scanner T1"]
        Scan2["Segment Scanner T2"]
        Scan3["Segment Scanner T3"]
        RLE["RLE Decompressor + SIMD"]
        Hash["Relationship Hashmap Lookup"]
        Scan1 & Scan2 & Scan3 --> RLE --> Hash
    end

    FE -->|"datacache requests"| SE
    SE -->|"datacache results"| FE
    FE --> Result["Query Result"]

SE runs multi-threaded across CPU cores; FE is single-threaded and coordinates the scan requests. The ratio of time spent in FE vs SE determines query optimization strategy.

The Storage Engine (SE) is the fast path. It understands columns, segments, dictionaries, and RLE. It can scan millions of rows per second using multiple CPU cores simultaneously, with SIMD instructions processing 256 or 512 bits at a time on modern hardware. The SE speaks in "datacache" requests — it returns raw aggregated results (sums, counts, distinct counts) that the FE then assembles into the final answer.

The Formula Engine (FE) is the slow path. It's single-threaded, it interprets DAX expression trees, it handles context transitions, and it manages the complex evaluation logic that makes DAX powerful. The FE can't run in parallel. When a query spends most of its time in the FE, no amount of hardware upgrades or model optimization will help — the bottleneck is the sequential evaluation logic.

The golden rule is roughly 80% SE / 20% FE. If your query plan shows 60% FE time, you're probably hitting a context transition or an iterating function that's forcing the FE to make many small datacache calls instead of one large scan. In DAX Studio, enable "Server Timings" to see the SE vs FE split for any query — it's the single most useful diagnostic available.

The Context Transition Trap

Context transitions are the most common cause of excessive FE time. They happen when CALCULATE (or an implicit CALCULATE inside a measure) converts a row context into a filter context. Each row in an iteration potentially triggers a new SE call for the inner expression.

-- SLOW: SUMX iterates every row, context transition on each
Running Total =
SUMX(
    Orders,
    CALCULATE( SUM(Orders[Amount]) )  -- context transition per row
)

-- FAST: SE-native aggregation, single datacache call
Running Total =
CALCULATE(
    SUM(Orders[Amount]),
    FILTER(
        ALL(Orders[OrderDate]),
        Orders[OrderDate] <= MAX(Orders[OrderDate])
    )
)

The first version above calls CALCULATE inside SUMX, which triggers a context transition for every row in the Orders table. With 500,000 orders, that's 500,000 individual SE calls. The second version pushes the entire logic to a single filter context that the SE can evaluate in one pass. Same result; wildly different performance.

Relationship Hashmaps: How Joins Actually Work

Relational databases implement joins at query time using hash joins or nested loops. VertiPaq does something completely different: it pre-computes the join structure at refresh time and stores it as a hashmap in memory — typically in CPU cache for small-to-medium dimension tables.

When the SE needs to traverse a relationship (say, from a fact table row to a dimension table row), it does a hashmap lookup that's essentially O(1) per row. The hashmap maps the foreign key values in the fact table to the corresponding row positions in the dimension table. Because the entire hashmap for a well-designed star schema fits in L3 cache, the relationship traversal cost is dominated by memory latency, not compute.

This design has one important consequence: relationship hashmaps are rebuilt every time either participating table is processed. For a fact table that refreshes hourly, this is fine. For a model where you're doing incremental partition processing on a 500M-row fact table, the hashmap rebuild for every relationship touching that table is a non-trivial cost. Keep your dimension tables small, keep your cardinality under control, and your relationship performance stays predictable.

V-Order: VertiPaq Optimization in Parquet

In Microsoft Fabric, semantic models can load data directly from Delta Lake Parquet files via Direct Lake mode. To make this fast, Microsoft invented V-Order — a set of write-time optimizations applied to Parquet files that make them read more efficiently by VertiPaq.

V-Order applies several transformations when writing Parquet: it sorts column data to maximize RLE compression (putting repeated values adjacent), applies dictionary encoding consistent with VertiPaq's encoding scheme, and packs data into row groups that align with VertiPaq's segment size expectations. The result is Parquet files that VertiPaq can load roughly 50% faster than standard Parquet, because less decompression and re-encoding work is needed at load time.

The catch: V-Order is currently only applied by Microsoft tools (Fabric Lakehouse writes, Dataflow Gen2, certain Spark configurations). If you write Parquet from external tools (vanilla PySpark, pandas, dbt) without explicitly enabling V-Order, you don't get these benefits. There's no open-source V-Order writer as of early 2026. Worth planning your data pipeline around this if Fabric Direct Lake is on your roadmap.

Direct Lake gotcha: V-Order is not automatically applied by external Parquet writers. Files written by vanilla Spark or dbt will load slower in Direct Lake mode than files written by Fabric's native tools. Check your pipeline: if data flows through Lakehouse notebook cells, V-Order is applied. If it arrives via ADLS Gen2 external shortcuts from non-Fabric tools, it probably isn't.

VertiScan and the SE Cache

The Storage Engine maintains an internal result cache called the VertiScan cache (sometimes called the datacache). When the same SE request is made multiple times — which happens constantly in dashboards with many visuals using the same base measure — the second and subsequent calls return the cached result without scanning any columns. This cache is per-session in interactive Power BI, and longer-lived in premium capacity deployments.

What this means in practice: the first visual on a page that uses a measure will incur the full SE scan cost. Subsequent visuals using the same measure at the same filter context get the cached result instantly. Report page layout actually matters for perceived performance: put your most expensive visuals first in tab order, let the cache warm up, and the page will feel faster overall.

The VertiScan cache is also why "cold start" vs "warm" query benchmarks differ dramatically. In a fresh session or after a dataset refresh, all caches are cold. A report that takes 8 seconds on first load might take 0.3 seconds on second load. Don't benchmark your DAX on the first run; benchmark after a cache warm-up to understand steady-state performance.

How VertiPaq Compares to Similar Engines

flowchart LR
    subgraph OLAP["In-Memory Columnar (OLAP)"]
        VP["VertiPaq\nPower BI / SSAS\nAll data in RAM\nProprietary format\nDAX query language"]
        CS["SQL Server\nColumnStore Index\nDisk-backed + cache\nSQL query language\nSame xVelocity core"]
    end

    subgraph OSS["Open-Source / Embedded"]
        Duck["DuckDB\nIn-process\nParquet/Arrow native\nSQL only\nNo server"]
        Arrow["Apache Arrow\nIn-memory format\nNo query engine\nColumnar IPC standard"]
    end

    subgraph Dist["Distributed Columnar"]
        Spark["Spark Parquet\nDistributed\nSQL + DataFrame\nS3/ADLS backed"]
        BQ["BigQuery / Redshift\nManaged cloud\nSQL only\nSeparated compute"]
    end

    VP -.->|"same engine core"| CS
    VP -.->|"V-Order for Parquet"| Spark
    Duck -.->|"Arrow IPC"| Arrow

VertiPaq and SQL Server ColumnStore share the same xVelocity core but differ in deployment model: VertiPaq lives entirely in RAM, ColumnStore is disk-backed with in-memory caching.

VertiPaq vs SQL Server ColumnStore Index

These two are siblings, not rivals. Both use xVelocity under the hood. The difference is operational: VertiPaq models are fully in-memory (the entire dataset must fit in RAM), while ColumnStore Indexes are disk-backed and use a buffer pool. ColumnStore can handle datasets larger than available RAM; VertiPaq cannot. VertiPaq wins on raw query speed for datasets that fit in memory because there's zero I/O latency — every column scan hits RAM or CPU cache directly.

VertiPaq vs DuckDB

DuckDB is the interesting comparison. DuckDB is an embedded analytical engine that runs in-process, queries Parquet files natively, and achieves impressive performance through vectorized execution and aggressive columnar optimization. For data engineering and analytical Python workflows, DuckDB is often the right choice — it's open-source, embeddable, and speaks SQL.

VertiPaq wins in the BI-specific scenario: interactive multi-user dashboards where dozens of DAX measures with complex filter contexts need sub-second responses. Its pre-computed relationship hashmaps, dictionary encoding, and SE cache are tuned specifically for this pattern. DuckDB lacks a comparable caching layer for repeated aggregation queries and doesn't support DAX's row/filter context semantics natively. Different tools for different problems.

VertiPaq vs Apache Arrow

Arrow is a memory format and IPC standard, not a query engine. Saying "VertiPaq vs Arrow" is a bit like comparing a car engine to a fuel standard. That said, VertiPaq's columnar layout is conceptually similar to Arrow's columnar memory format — both store column vectors contiguously in memory. The key difference is that VertiPaq is opaque and proprietary; Arrow is an open standard that many engines (Spark, DuckDB, pandas, Polars) use as an interchange format. Microsoft's V-Order optimization on Parquet is essentially a VertiPaq-friendly hint embedded in an open format.

Best Practices From the Trenches

Knowing the internals makes the best practices obvious rather than magical:

Minimize column cardinality ruthlessly. High-cardinality columns (GUIDs, timestamps with milliseconds, free-text fields) explode dictionary size and make RLE useless. Ask: do I actually filter or group by this column? If not, cut it.
Sort dimension tables by frequently filtered columns. VertiPaq's RLE compression improves dramatically when equal values are adjacent. A Date dimension sorted by Date is already optimal. A Customer dimension sorted by Country, then Region, then City can cut model size by 20–30% compared to arbitrary row order.
Avoid wide fact tables. Every column in a fact table costs memory, even if it compresses well. Use integer surrogate keys instead of text foreign keys; let the dimension dictionary carry the strings.
Use integers for numeric columns whenever possible. Value encoding only works on integer types. A decimal column gets dictionary encoding, which is less efficient. If you're storing amounts to two decimal places, consider multiplying by 100 and storing as integer.
Profile with VertiPaq Analyzer before and after every significant model change. It shows column-level memory usage, segment counts, and cardinality. Most model bloat is invisible until you actually look at the numbers.
Aim for 80% SE / 20% FE in query plans. Anything with more than 30% FE time deserves investigation. Context transitions are usually the culprit. Rewrite iterating measures to use SE-native functions (SUMX → SUM + FILTER pattern) wherever feasible.
Partition large fact tables for incremental refresh. But only for tables that actually change. Processing unchanged partitions burns CPU on hashmap rebuilds for zero benefit.

flowchart TD
    Start([Performance Problem?]) --> Q1{Server Timings:\nFE % > 30?}

    Q1 -->|Yes| FEPath["Formula Engine Issue"]
    FEPath --> FE1["Check for SUMX/AVERAGEX\nwith CALCULATE inside"]
    FEPath --> FE2["Check context transitions\nin calculated columns"]
    FEPath --> FE3["Use ALL/ALLEXCEPT\ninstead of REMOVEFILTERS"]

    Q1 -->|No| Q2{VertiPaq Analyzer:\ncolumn > 50MB?}

    Q2 -->|Yes| SEPath["Storage Engine / Model Issue"]
    SEPath --> SE1["High cardinality column?\nRemove or hash bucket"]
    SEPath --> SE2["Check sort order:\nare equal values adjacent?"]
    SEPath --> SE3["Relationship count:\n> 20 relationships rebuilding hashmaps?"]

    Q2 -->|No| Q3{Cold only or\nalways slow?}
    Q3 -->|Cold only| Cache["Cache issue:\nwarm up with pre-aggregations\nor Premium caching policy"]
    Q3 -->|Always| Network["Network / gateway\nbottleneck — not VertiPaq"]

A decision tree for diagnosing Power BI performance problems. Start with Server Timings in DAX Studio to split FE vs SE, then drill into VertiPaq Analyzer for model-level metrics.

The Evolution Continues: What's Changed in Fabric

VertiPaq in Microsoft Fabric 2025 is meaningfully different from what shipped in Power BI Desktop 2016, even though the core encoding logic is unchanged. The notable additions:

Direct Lake mode: Semantic models can read from Delta Parquet files directly without an import step. The engine loads V-Order-optimized Parquet into VertiPaq's in-memory format on demand, with a warm/hot state cache in Fabric premium capacity. Cold-start loads are still slower than pre-refreshed Import mode, but the gap has closed significantly.
Automatic aggregations: Fabric semantic models can define pre-aggregated summary tables that VertiPaq uses to satisfy queries without scanning the full fact table. This is most valuable for very large fact tables (billions of rows) where even columnar scans become slow at dashboard scale.
Semantic model web authoring: Live editing of DAX measures via browser-based IDE with integrated Server Timings. Still early, but it removes the friction of installing DAX Studio for basic diagnostics.
User-defined aggregations vs. automatic: The old composite models with manual aggregation tables are still valid, but the ML-driven automatic aggregation detection in Fabric is gradually making manual aggregation table management obsolete for common patterns.

The throughput story is also improving. Fabric Premium F64 SKUs and above get genuinely beefy RAM allocations — models that would have been impractical in Power BI Premium P1 are now viable. The architecture is the same; the hardware constraints have loosened considerably.

Tooling: How to Actually See What's Happening

Reading about VertiPaq internals is useful. Actually instrumenting your model is essential. The two tools you need:

DAX Studio (free, daxstudio.org): Connect to any Power BI Desktop or Analysis Services model. Enable "Server Timings" before running a query to see SE vs FE split, number of SE calls, and total scan duration. The "Query Plan" view shows the logical and physical plan generated by the FE. Essential for diagnosing slow measures.
VertiPaq Analyzer (free, built into DAX Studio / SQLBI): Shows every column's memory footprint, dictionary size, segment count, cardinality, and compression ratio. The columns view sorted by "Total Size" will immediately reveal where your model's memory is going. Most model bloat is 2–3 columns that nobody realized were high-cardinality.

Run VertiPaq Analyzer on every new model you build. The first time you see a free-text "Notes" column sitting at 400 MB because it has 500,000 distinct values and zero compression, you will immediately understand why cardinality management exists as a practice.

The Bottom Line

VertiPaq has been the same engine for 15 years because the fundamentals are right: columnar storage, dictionary encoding with RLE, SIMD-accelerated scans, pre-computed relationship hashmaps, and a clean separation between multi-threaded scanning (SE) and sequential evaluation (FE). The optimizations added over the years — V-Order, Direct Lake, automatic aggregations — layer on top of this foundation rather than replacing it.

Understanding the internals isn't academic. Every best practice in Power BI model design — star schemas, integer keys, sorted dimensions, narrow fact tables, SE-native DAX patterns — is a direct consequence of how these encoding and scan mechanisms work. Once you see the system, the rules become obvious rather than arbitrary cargo cult wisdom from some blog post (including this one).

The next time your report is slow, open DAX Studio first. Forty-five seconds with Server Timings will tell you more than three hours of guessing.