🧱 This is Part 3 of a 4-part series: Databricks Deep Dive
- The Data Intelligence Platform: A Practitioner's Overview
- Internals: Photon, the Delta Log, and How a Query Actually Runs
- Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout (you are here)
- Building a HIPAA-Compliant Health Data Lakehouse
Most Spark tuning advice on the internet is a list of config flags to copy-paste, which is exactly why most Spark tuning fails. Flags without a model of why a job is slow are cargo-culting. So this article assumes you've read Part 2 — you know that the Delta log decides how much you read, that shuffles are the expensive stage boundaries, and that Photon accelerates the compute in between. Every technique here targets one of those three. That's the whole framework: read less, shuffle less (and evenly), and keep the compute vectorized.
I'll go in the order I actually work a slow job: measure first, then layout, then shuffle and joins, then skew, then the cluster. Reordering that list is how people waste a week tuning the cluster when the real problem was a missing OPTIMIZE.
Rule zero: read the query plan before changing anything
The Spark UI's SQL tab and the query plan are the only honest source of truth. Before touching a single knob, I look for four things: how much data was scanned (did file skipping work?), where the shuffles are, whether any stage spilled to disk, and whether tasks within a stage finished at wildly different times (the signature of skew — one task taking 10× the others). Ninety percent of the time, the plan tells you the answer before you've formed a hypothesis.
The most expensive mistake is optimizing the wrong stage. A job that takes 40 minutes because one skewed task runs for 35 of them will not get faster if you add nodes, raise shuffle partitions, or enable caching. You have to fix the skew. Always let the plan tell you which problem you actually have.
Layer 1: read less — data layout and file skipping
This is the highest-leverage layer and the one people skip because it's not a code change. From Part 2: the engine uses per-file min/max statistics to skip files before reading. The job of data layout is to make those ranges tight so skipping is maximally effective. The wrong layout means every file's min/max spans the whole range, nothing is skippable, and every query is a full scan no matter how clever your code.
OPTIMIZE and the small-files problem
Streaming and frequent small writes produce thousands of tiny files, and tiny files are death by a thousand metadata operations. OPTIMIZE compacts them into right-sized files (around 1 GB target), which both cuts file-listing overhead and tightens statistics:
OPTIMIZE orders;
-- compacts small files; on liquid-clustered tables also clusters incrementally
Z-order vs liquid clustering
To make skipping work for queries that filter on specific columns, you co-locate similar values so each file holds a narrow range. Two mechanisms, and in 2025 the choice is mostly made for you:
| Z-ordering | Liquid clustering | |
|---|---|---|
| How | OPTIMIZE t ZORDER BY (col) — multi-dimensional sort on each run | CLUSTER BY on the table — incremental, self-managing |
| Rewrite cost | Full rewrite of affected data each OPTIMIZE | Incremental — only new/changed data, no full rewrite |
| Changing keys | Painful — re-Z-order everything | Just change CLUSTER BY; layout adapts over time |
| 2025 guidance | Legacy / existing partitioned tables | Default for new tables |
-- New tables: liquid clustering, no partitioning needed
CREATE TABLE orders (order_id BIGINT, customer_id BIGINT, order_date DATE, net_amount DECIMAL(18,2))
CLUSTER BY (order_date, customer_id);
-- Existing table: enable it
ALTER TABLE orders CLUSTER BY (order_date, customer_id);
OPTIMIZE orders;
Stop hash-partitioning Delta tables by high-cardinality columns. The old Hive habit of PARTITIONED BY (customer_id) creates the small-files problem it was meant to solve. On Delta, liquid clustering replaces partitioning for almost everything. Reserve physical partitioning for genuinely low-cardinality, always-filtered columns (e.g., a date for retention/deletion), and let clustering handle the rest.
Layer 2: shuffle less — AQE and partition sizing
Shuffles are the expensive stage boundaries. You can't eliminate all of them, but you can keep them well-sized and let AQE do the heavy lifting.
Confirm AQE is on (it's your best friend)
Adaptive Query Execution re-optimizes at runtime using real statistics: it coalesces too-many-tiny shuffle partitions, flips sort-merge joins to broadcast when a side turns out small, and splits skewed partitions. It's on by default on modern Databricks runtimes — but I always confirm, because a surprising number of "slow Spark" tickets are jobs that disabled it years ago and never turned it back on.
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
The shuffle-partition target
The classic mistake is the default spark.sql.shuffle.partitions = 200 applied to every job regardless of size. Too few and each partition is huge and spills to disk; too many and you drown in task-scheduling overhead on tiny partitions. The rule of thumb that's served me well: aim for ~128–200 MB of shuffled data per partition. With AQE's coalescing enabled you can set a generous partition count and let it merge down at runtime — which is exactly the point of AQE, so I lean on it rather than hand-tuning per job.
Watch for spill
Spill — Spark writing shuffle or sort data to disk because it didn't fit in executor memory — is one of the biggest silent performance killers, and it shows up plainly in the Spark UI as "spill (memory)" and "spill (disk)." Spill usually means partitions are too big (raise partition count / let AQE coalesce less aggressively) or a single partition is skewed (next section). Memory-bound spill is also a signal you may want a memory-optimized instance type rather than just more nodes.
Layer 3: the join decision — broadcast vs sort-merge
Joins are where shuffles are born, and the single best optimization is to avoid the shuffle entirely with a broadcast join: when one side is small enough, ship it in full to every executor so the large side never has to move. No shuffle, no stage boundary, dramatically faster.
from pyspark.sql.functions import broadcast
# Force-broadcast the small dimension table
result = orders.join(broadcast(dim_customers), "customer_id")
AQE will often choose this automatically once it sees the runtime size, but an explicit broadcast() hint is worth it when you know a table is a small dimension and don't want to gamble on the optimizer's estimate. The flip side: don't broadcast something that's actually large — you'll OOM the driver collecting it and every executor holding a copy. The honest heuristic is single-digit-hundreds of MB and below; above that, let it sort-merge.
Layer 4: the boss fight — data skew
Skew is the performance problem that humbles people, because it doesn't respond to any of the obvious levers. The symptom: one task in a stage runs for minutes while the other 199 finished in seconds. The cause: a shuffle key with a wildly uneven distribution — one customer_id with 40% of the rows, a NULL join key that swallows everything, a single hot product. All the matching rows hash to one partition, one task gets all of them, and that task is your runtime.
graph TD
K["Shuffle by customer_id"] --> P1["Partition A
~5k rows · 2s ✅"]
K --> P2["Partition B
~5k rows · 2s ✅"]
K --> P3["Partition C — the 'whale'
~4M rows · 6 min ⛔ straggler"]
P1 --> DONE["Stage can't finish
until the straggler does"]
P2 --> DONE
P3 --> DONE
One hot key turns a stage into a single-task job. Adding nodes does nothing — the work is stuck on one task. This is why "throw more cluster at it" so often fails.
The fixes, in the order I try them:
- Turn on AQE skew-join handling.
spark.sql.adaptive.skewJoin.enableddetects oversized partitions and automatically splits them into sub-partitions. This alone resolves a large share of real-world skew with zero code change — try it first. - Filter the junk key. Astonishingly often the "hot key" is
NULLor a sentinel like-1/'UNKNOWN'that shouldn't participate in the join at all. Filtering it before the join can make the problem vanish. - Broadcast the other side if it's small enough — no shuffle, no skew.
- Salt the key as a last resort: append a random suffix to the hot key on both sides to spread it across partitions, then aggregate back. It works, but it's invasive and ugly, so I only reach for it when AQE skew handling genuinely isn't enough.
Layer 5: keep it on Photon — avoid UDFs
From Part 2: Photon accelerates built-in SQL and DataFrame operations in vectorized C++, but a Python (or even Scala) UDF can knock the query off the Photon path and back onto row-at-a-time JVM execution — and a Python UDF additionally pays serialization cost shuttling rows to a Python process and back. The discipline: express logic in built-in functions wherever remotely possible. Spark's function library is enormous; the regex, date math, JSON parsing, and conditional logic you'd reach for a UDF to do almost always has a native equivalent that stays in Photon. When you genuinely need custom logic, prefer pandas UDFs (vectorized, Arrow-based) over plain Python UDFs, and check the plan to see how much fell back.
Layer 6: caching, and when it's a trap
Caching (df.cache() / Delta caching on the SSD) helps when you read the same data many times in one session — iterative ML, repeated exploration. But it's frequently misapplied: caching data you read once wastes memory you needed for execution and can cause spill. My rule: cache only when you can name the specific re-reads it saves, and let the Databricks SSD cache (automatic on appropriate instance types) handle the common case rather than caching manually. Caching is a targeted tool, not a default.
The cluster, last — not first
I put cluster sizing last on purpose, because reaching for it first is the classic anti-pattern: a 40-minute job dominated by skew or a full scan does not get better with more nodes. Once the query is actually efficient, then the cluster questions are worth asking — match the instance type to the bottleneck (memory-optimized for spill-prone shuffles, compute-optimized for Photon-heavy aggregation), turn on autoscaling for variable workloads, set aggressive auto-termination so idle clusters don't bleed money, and use job clusters for production rather than leaving all-purpose clusters running. The cost side of this is its own discipline — I cover it in FinOps for data platforms.
The playbook, in one table
| Symptom in the plan | Likely cause | First move |
|---|---|---|
| Huge bytes scanned vs. data needed | No file skipping — bad layout | OPTIMIZE + liquid clustering on the filter columns |
| One task runs 10×+ longer than peers | Data skew on a shuffle key | Enable AQE skew join; filter NULL/sentinel keys |
| Spill (disk) in a shuffle stage | Partitions too big / skew | Let AQE coalesce; check skew; memory-optimized nodes |
| Expensive shuffle for a small-table join | Sort-merge where broadcast would do | broadcast() the small side |
| Stage off the Photon path | Python/Scala UDF in the plan | Replace with built-in functions; pandas UDF if not |
| Thousands of tiny files | Frequent small writes | OPTIMIZE; reduce write frequency / use a merge |
The closing principle
Fast Spark is not a config file — it's a habit of reasoning. Read less by laying data out so the Delta log can skip it. Shuffle less, and evenly, by trusting AQE, sizing partitions, broadcasting small joins, and killing skew. Keep the compute vectorized by staying on Photon. And always — always — read the plan before you touch a knob, because the plan tells you which of those three is actually your problem. Get that ordering right and you'll fix in an afternoon what teams spend weeks guessing at.
In Part 4, the finale, all of this stops being abstract: I build a real, regulated health data lakehouse on Databricks — medallion architecture, HL7/FHIR ingestion, Unity Catalog governance for PHI, and the performance and compliance decisions made together, because in healthcare you don't get to optimize one without the other.
🧱 Continue the series
- The Data Intelligence Platform: A Practitioner's Overview
- Internals: Photon, the Delta Log, and How a Query Actually Runs
- Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout (this article)
- Building a HIPAA-Compliant Health Data Lakehouse →