Most people treat a Power BI semantic model as a black box. You connect some tables, write a few measures, publish, and occasionally wonder why your dashboard takes nine seconds to load when your colleague's does the same calculation in under one. This post is about opening that box โ not to make it seem complicated, but because understanding what's inside makes a lot of previously mysterious behavior suddenly obvious.
Fair warning: this goes pretty deep. There will be compression algorithms and engine internals. If that's your thing, read on.
VertiPaq: Not Just "In-Memory Storage"
When people say Power BI loads data "into memory," they're technically correct but slightly misleading. Data doesn't go into memory the way a CSV file gets loaded into a pandas DataFrame. It goes through VertiPaq โ a columnar, compressed, in-memory analytical database that Microsoft also uses in Analysis Services and, increasingly, in Microsoft Fabric.
The columnar part matters. VertiPaq doesn't store your table as rows. It stores each column as a separate, independently compressed array. A fact table with 10 million rows and 20 columns becomes 20 compressed column segments. Analytical queries that touch 3 of those 20 columns only need to decompress those 3 segments. The other 17 are completely ignored. For data warehouse-style workloads where you're almost always aggregating a small number of columns across a large number of rows, this is a massive win over row-based storage.
The Compression Tournament Every Column Goes Through
Before VertiPaq can apply run-length encoding (RLE) โ compressing "New York, New York, New York, New York" into "New York ร 4" โ it needs to get each column into a numeric form first. This is where the encoding stage happens, and it's worth knowing about.
Value encoding (integers only)
For integer columns, VertiPaq finds the minimum value in the column and subtracts it from everything. A column containing [1000, 1001, 1002, 1003] becomes [0, 1, 2, 3]. Now you don't need 32-bit integers to represent these values โ 2 bits is enough. The engine then applies a bit-packing scheme to pack values as densely as possible into the column segment. The actual compression ratios depend on the data, but for well-behaved integer keys with low cardinality, this is very effective.
Gotcha: Columns used in relationships never get value encoding, even if they're integers. VertiPaq needs those key columns in dictionary-encoded form for its join structures. This is often why a surrogate key column looks disproportionately large in VertiPaq Analyzer.
Dictionary (hash) encoding โ everything else
String columns and integers where value encoding isn't efficient get dictionary encoding. VertiPaq builds a lookup table of every unique value in the column, assigns each one an integer ID, then stores only the IDs. "United States" might be ID 42. Instead of storing those two words on 10 million rows, VertiPaq stores 42 ten million times โ and a 2-byte integer beats a 13-character string every time.
The implication you need to internalize: high-cardinality string columns are expensive. A column with 8 million unique values needs an 8-million-entry dictionary. There's no meaningful compression on the dictionary itself. This is why GUIDs, hashed tokens, and free-text description fields can balloon your model size โ and why you should almost never import a column you don't actually need in a report.
The Two Engines and Why You Should Care About Their Ratio
flowchart LR
Q([DAX Query]) --> FE
subgraph FE ["Formula Engine โ single-threaded"]
direction TB
F1["Parses DAX logic\napplies filter context"]
F2["Iterates rows\n(SUMX / FILTER / etc.)"]
F1 --> F2
end
FE -->|"datacache request"| SE
subgraph SE ["Storage Engine โ multi-threaded"]
direction TB
S1["Decompresses\ncolumn segments"]
S2["Scans & aggregates\nin parallel threads"]
S1 --> S2
end
SE <-->|"reads compressed\ncolumn arrays"| VP[("VertiPaq\nIn-Memory Store")]
SE -->|"datacache response"| FE
FE --> R([Result])
Ideal flow: 80% of work in the multi-threaded SE, 20% in the FE. When an iterator forces the FE to loop row-by-row, that ratio inverts and performance collapses.
Every DAX query flows through two distinct processing engines. Understanding their different capabilities is the difference between writing DAX that performs and writing DAX that looks right but silently degrades at scale.
The Storage Engine (SE)
The SE is fast, multi-threaded, and operates directly on compressed column segments. It handles data retrieval, scanning, and simple aggregations. When you call SUM(Sales[Amount]), the SE scans the compressed Amount column, decompresses it, and sums it โ all in parallel threads, often using SIMD CPU instructions. It generates "datacache" responses that the FE consumes.
The Formula Engine (FE)
The FE is where DAX logic lives. It's single-threaded, operates row by row in certain contexts, and handles the complex filtering, calculation, and iteration that DAX is known for. It sends requests to the SE and waits for responses โ synchronously, one at a time. That "one at a time" part is important.
A healthy DAX query looks something like: FE receives the query, generates a few efficient SE requests, SE executes them in parallel and returns datacaches, FE assembles the results. A unhealthy DAX query looks like: FE generates hundreds of individual SE requests (one per row, one per filter combination), SE executes them sequentially, and your visual takes 12 seconds to load on a table with 80K rows.
In DAX Studio, you can see the SE/FE split in the query trace. If your query is generating more than a handful of SE calls, something is wrong with how the DAX is structured. A good ratio is roughly 80% SE work and 20% FE overhead.
Context Transition: The Most Expensive Thing Nobody Warns You About
This is the one that gets people. Context transition is the mechanism by which CALCULATE converts an existing row context into a filter context. It sounds abstract until you understand what it costs at scale.
Every time you reference a measure inside an iterator (SUMX, FILTER, AVERAGEX, etc.), that measure call is automatically wrapped in CALCULATE, which triggers context transition for every row of the iteration. The model is re-filtered for each row individually.
-- Looks innocent, genuinely isn't on large tables
Total Margin =
SUMX(
Sales,
[Unit Price] - [Unit Cost] -- measure references inside iterator
)
On a fact table with 5 million rows, that SUMX creates 5 million filter contexts. The FE evaluates each one individually. Even if each evaluation is microseconds, 5 million of them adds up โ and the FE's single-threaded nature means they're serialized. This is the pattern that makes developers say "DAX is slow" when really the DAX is structurally wrong for the workload.
The fix: Rewrite to computed column expressions or use native aggregation functions where possible. SUMX(Sales, Sales[UnitPrice] - Sales[UnitCost]) references columns, not measures โ no context transition, the SE handles it. The difference in execution time on large tables can be 50x or more.
When context transition is intentional
Context transition isn't inherently evil โ it's the mechanism that makes measures dynamic. When you build a measure that calculates "Sales for the currently selected product," context transition is exactly what you want. The problem is doing it 5 million times inside an iterator when you didn't mean to. The diagnostic is: know where your iterators are, and think carefully before putting measure references inside them.
Storage Modes: The Decision That Shapes Everything Else
This gets covered a lot, so I'll focus on the parts that usually get glossed over.
Import mode โ still the benchmark
Data loads into VertiPaq at refresh time and stays there. Queries run entirely in memory, which means they're fast, consistent, and not at the mercy of your source database's performance. The tradeoffs are well-known: 64GB model size limit in Fabric capacities (though large models can go higher with Premium), refresh latency, and memory consumption that scales with your data volume.
What's less appreciated: Import mode lets you write DAX freely. Complex time intelligence, iterators, nested CALCULATE calls โ VertiPaq handles them well because it has full control of the data and can use column statistics, result caches, and its compression structures efficiently.
DirectQuery โ good in theory, humbling in practice
Every DAX query gets translated to SQL and sent to your source database. The translation engine is genuinely impressive for simple measures. For complex ones, it can produce SQL that would make a database administrator weep. I once saw a YTD measure generate a 400-line nested subquery against a Snowflake warehouse. The query returned the right answer. It took 38 seconds.
DirectQuery works well when your source database is fast, your measures are simple, and your users can tolerate some query latency. It falls apart when any of those conditions aren't met, and the failure mode is silent โ you get correct results, just very slowly.
Direct Lake โ the interesting one
Direct Lake reads Delta Parquet column segments directly from OneLake and loads them into VertiPaq on demand. No scheduled refresh. Near-Import performance on data that can be updated continuously.
The model goes through distinct performance states you should understand:
| State | What it means | Query performance |
|---|---|---|
| Cold | No columns loaded into VertiPaq memory | First query triggers transcoding from Parquet โ noticeable delay |
| Semiwarm | Some columns loaded, others still on disk | Mixed โ fast for cached columns, slower for uncached ones |
| Warm | All required columns in VertiPaq memory | Import-equivalent |
| Hot | Warm + VertiScan result caches from repeated queries | Fastest โ results may be served from cache |
stateDiagram-v2
direction LR
[*] --> Cold : model deployed / long idle
Cold --> SemiWarm : first query triggers\nParquet โ VertiPaq transcoding
SemiWarm --> Warm : all needed columns\nloaded into memory
Warm --> Hot : VertiScan cache builds\nfrom repeated queries
Hot --> Warm : new query pattern\ncache miss
Warm --> SemiWarm : memory pressure\nevicts some columns
SemiWarm --> Cold : capacity flush\nor extended idle
Direct Lake model states. Framing (picking up new Delta table versions) is near-instant and doesn't reset this state machine โ column eviction is the only thing that drops you back toward Cold.
Under memory pressure, VertiPaq evicts column segments and the model drops back toward cold. This is why a Direct Lake dashboard can feel blazing fast during business hours (when it's been warmed up by 20 report users) and inexplicably sluggish on Monday morning when nobody has touched it since Friday afternoon.
Practical Optimization: What Actually Moves the Needle
Generic advice like "avoid high-cardinality columns" and "use star schemas" is true but not very actionable. Here's what actually matters in practice.
Use DAX Studio before changing anything
DAX Studio connects to your semantic model and lets you run DAX queries while capturing a detailed trace. You'll see: every SE call generated, how many there were, how long each took, and the overall FE/SE split. This costs 10 minutes and usually identifies the actual problem immediately. A lot of "optimization work" is really just fixing one badly structured measure that's generating 300 SE callbacks.
Run VertiPaq Analyzer on your model
VertiPaq Analyzer (also inside DAX Studio) shows your model's memory footprint column by column. Look for: columns where the dictionary is larger than the data column (high-cardinality strings), tables with many more rows than you expected (intermediate tables that should be calculated, not materialized), and calculated columns on large fact tables (every calculated column is evaluated at refresh time and stored for every row).
Star schema matters for compression, not just query design
Textbooks say star schema improves query performance because simpler joins mean simpler SQL. In VertiPaq land the reason is slightly different. Dimension tables tend to have low-cardinality columns (Status, Category, Region) โ these compress extremely well. Fact tables have numeric foreign keys that also compress well. Keeping your dimension attributes separate from your fact table maximizes VertiPaq's ability to compress each column optimally.
A single wide denormalized table with 40 columns is not just hard to maintain โ it actively hurts compression because string attributes that belong to slow-moving dimensions are replicated alongside millions of transaction rows, defeating run-length encoding.
Think twice before calculated columns on fact tables
A calculated column on a 10-million-row fact table stores 10 million values. It's computed at every refresh and held in RAM indefinitely. A measure computes on demand and is never stored. If the calculation is aggregation-based, a measure is almost always the right choice. Calculated columns earn their keep for things that feed relationships (integer keys), enable slicing, or require row-level categorization logic that's too expensive to repeat at query time โ but that's a short list.
A Note on DAX and "Context"
If you've been writing DAX for a while and still find filter context and row context confusing, you're not alone โ the documentation explains the mechanics but rarely explains the intuition. Here's the one-liner that actually helped me: filter context is about which rows are visible; row context is about which row you're currently on. They're separate concerns that interact only through context transition (via CALCULATE). When something behaves unexpectedly, the diagnostic question is usually "what filter context is this measure being evaluated in, and is that what I intended?"
Wrapping Up
Semantic models aren't magic โ they're VertiPaq underneath, with two distinct engines handling your queries, a compression system that rewards good data modeling, and a storage mode choice that determines your entire performance envelope. Understanding those facts won't make you write better-looking DAX, but it will make you write faster DAX โ and help you debug it when it isn't.
The models that perform consistently are usually the ones built by someone who thought deliberately about cardinality, avoided unnecessary columns, structured their star schema to maximize VertiPaq compression, and tested with real query traces rather than guessing. That's a learnable skill, not an artform.