← Back to Blog

Databricks Internals: Photon, the Delta Log, and How a Query Actually Runs

🧱 This is Part 2 of a 4-part series: Databricks Deep Dive

  1. The Data Intelligence Platform: A Practitioner's Overview
  2. Internals: Photon, the Delta Log, and How a Query Actually Runs (you are here)
  3. Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout
  4. Building a HIPAA-Compliant Health Data Lakehouse

You can run Databricks for years treating it as a black box that turns SQL into answers. Plenty of people do. But the moment a job is mysteriously slow, or a "simple" query scans ten times more data than it should, the black box stops being comfortable — and the engineers who can fix it fast are the ones who know what's happening underneath. This article is that layer. By the end you should be able to look at a query and reason about how many files it'll read, why it shuffled, and whether Photon is helping.

In Part 1 I described the platform shape — control plane, compute plane, open tables, one governance layer. Now we open three boxes: the Delta transaction log (how files become a reliable table), the Spark execution model (how a query becomes distributed work), and Photon (why some of that work runs in C++ instead of the JVM). Then we trace one query through all three. The payoff is Part 3, where this understanding becomes a tuning playbook.

The Delta transaction log: how files become a table

A Delta table is a directory in object storage. Inside it are Parquet data files and a _delta_log subdirectory. That log folder is the entire trick — it's what makes a scattered set of immutable files behave like a transactional table on storage (S3/ADLS/GCS) that itself offers no transactions.

Every change to the table writes a new JSON commit file to the log: 00000000000000000000.json, then ...001.json, and so on, monotonically. Each commit records actions — which Parquet files were added, which were removed (logically; the bytes stay until vacuumed), and metadata or schema changes. The table's current state is the result of replaying the log from the start.

my_table/
├── part-00000-....snappy.parquet      ← data files (immutable)
├── part-00001-....snappy.parquet
├── part-00002-....snappy.parquet
└── _delta_log/
    ├── 00000000000000000000.json      ← commit 0: added files A, B
    ├── 00000000000000000001.json      ← commit 1: removed A, added C
    ├── 00000000000000000002.json      ← commit 2: added D, E
    └── 00000000000000000010.checkpoint.parquet   ← Parquet snapshot of state

Two consequences fall straight out of this design:

  • ACID without a database. A reader determines the table version by listing the log and reading the latest commits; a writer commits by atomically creating the next-numbered JSON file. If two writers race for the same commit number, only one wins the create — the other retries against the new state. That's optimistic concurrency, and it's how you get atomic, isolated transactions on plain object storage.
  • Time travel is free. Because every version is just "the log replayed up to commit N," you can read the table as of any version or timestamp — SELECT * FROM t VERSION AS OF 5 — with no extra storage beyond the files that version referenced. It's the same insight behind Snowflake's Time Travel, which I covered in Snowflake Internals, reached by a different mechanism.

Checkpoints and why metadata stays fast

Replaying ten thousand JSON files to learn the current state would be miserable. So every 10 commits (by default) Delta writes a checkpoint — a Parquet file summarizing the entire table state at that point. A reader loads the latest checkpoint and then only the handful of JSON commits after it. This is why Delta metadata stays fast even on tables with millions of files: you never replay from zero.

The performance payload: statistics and data skipping

Here's the part that matters most for query speed. When Delta adds a data file, the commit also stores per-file statistics — for the leading columns, the min and max values, the null count, the row count. These live in the log, not the data files.

That means before reading a single byte of Parquet, the engine can consult the log and skip files that can't possibly match. Query WHERE order_date = '2025-03-01' and any file whose min/max date range excludes that day is never opened. This is data skipping, and on a well-laid-out table it's the difference between scanning 4 files and scanning 4,000. The entire point of OPTIMIZE, Z-ordering, and liquid clustering — all of Part 3 — is to arrange data so these min/max ranges are tight and skipping is maximally effective.

The mental model to keep: the Delta log is a metadata index that the engine reads first. Half of query performance is decided before any data is touched — by how many files the log lets the engine skip. Bad data layout = loose min/max ranges = nothing skipped = full scan.

The Spark execution model: query → jobs → stages → tasks

Now the compute side. When you submit a query or an action, the cluster's driver builds a plan and breaks it into a hierarchy of work that gets distributed to executors. The vocabulary is worth getting exactly right, because every performance discussion uses it:

UnitWhat it isBoundary
JobAll the work triggered by one action (e.g., a write, a collect)One per action
StageA set of tasks that can run without moving data between executorsA new stage starts at every shuffle
TaskThe unit of execution — one task processes one partition of data on one coreOne per partition per stage

The crucial concept is the stage boundary, and it's always a shuffle. Some operations — filter, project, map — are narrow: each output partition depends on exactly one input partition, so the work stays local on each executor. Others — join, groupBy, distinct, window functions — are wide: output partitions depend on many input partitions, so data must be redistributed across the network so that all rows with the same key land on the same executor. That redistribution is the shuffle, and it's the single most expensive thing Spark does.

graph LR
    subgraph S1["Stage 1 (narrow — no data movement)"]
        T1["Task: scan + filter partition 1"]
        T2["Task: scan + filter partition 2"]
        T3["Task: scan + filter partition 3"]
    end
    SH(["SHUFFLE
redistribute by join/group key
(network + disk write/read)"]) subgraph S2["Stage 2 (after shuffle)"] T4["Task: aggregate key-group A"] T5["Task: aggregate key-group B"] end T1 --> SH T2 --> SH T3 --> SH SH --> T4 SH --> T5

A shuffle ends one stage and begins the next: every executor writes its data out partitioned by the key, then every executor reads back the partitions it owns. It crosses the network and hits disk. Almost every Spark performance problem is either "too much shuffle" or "shuffle skewed onto one task" — which is why Part 3 spends most of its time here.

Catalyst and AQE: the planning brain

Before any of that runs, the Catalyst optimizer turns your SQL or DataFrame code into a logical plan, rewrites it (predicate pushdown, column pruning, constant folding, join reordering), and picks physical operators — crucially, which join strategy to use (broadcast vs sort-merge). On top of that, Adaptive Query Execution (AQE) re-optimizes at runtime using actual statistics gathered as stages complete: it coalesces too-many-tiny shuffle partitions, switches a sort-merge join to a broadcast join when it discovers a side is small, and splits skewed partitions. AQE is on by default and it's the reason a lot of mediocre code runs acceptably anyway — and the first thing I confirm is enabled when something's slow. It's a headline act in Part 3.

Photon: why some of this runs in C++

Classic Spark executes on the JVM, processing data largely a row at a time through generated Java code. That's portable and flexible, but it leaves performance on the table: garbage-collection pauses, JIT warm-up, per-row interpretation overhead, and no real use of modern CPU vector instructions. Photon is Databricks' answer — a query engine rewritten from scratch in C++ that the runtime drops in beneath Spark for eligible operations.

Photon doesn't replace Spark; it accelerates the parts it can and hands the rest back. What it does differently:

  • Vectorized, columnar execution. Instead of one row at a time, Photon processes batches of column values in tight loops that the CPU can pipeline and run with SIMD vector instructions. Columnar + batched is dramatically more cache- and CPU-friendly than row-at-a-time.
  • Native C++, no JVM. No garbage collection pauses, no JIT warm-up, tighter memory control. For CPU-bound work — aggregations, joins, filters over lots of rows — this is where the speedups come from.
  • Built for Delta and Parquet. It's tuned for exactly the columnar formats the lakehouse uses, and pairs naturally with data skipping: skip files with the log, then crunch what's left with vectorized C++.

Photon isn't magic, and it isn't universal. It accelerates SQL and DataFrame operations, but it doesn't run arbitrary Python/Scala UDFs or every exotic operator — those fall back to Spark, and a single non-Photon operator in the middle of a plan can force a chunk of the query back onto the JVM. The practical lesson, which Part 3 hammers: prefer built-in functions over UDFs, partly because the built-ins stay in Photon. And Photon-enabled compute carries a higher DBU rate — it's worth it when it actually accelerates your workload, and pure waste when your job is I/O-bound or dominated by UDFs.

Tracing one query end to end

Let's put all three boxes together. You run:

SELECT c.region, SUM(o.net_amount) AS revenue
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY c.region;

Here's the journey:

  1. Plan (driver, Catalyst). The query is parsed and optimized. The order_date filter is pushed down. Catalyst decides the join strategy — if customers is small, a broadcast join (ship it to every executor, no shuffle for the join); otherwise a sort-merge join (shuffle both sides by customer_id).
  2. Prune files (Delta log). Before reading data, the engine consults the orders transaction log and uses the per-file min/max stats on order_date to skip every file entirely before 2025. Only surviving files are scheduled to be read. This decision already determined most of the runtime.
  3. Stage 1 — scan + filter (narrow, Photon). Tasks read the surviving Parquet files, applying the filter. Photon does this vectorized in C++. One task per file-partition, spread across executors. No data movement yet.
  4. Shuffle (if sort-merge). If the join wasn't broadcast, both sides shuffle by customer_id so matching rows co-locate. Stage boundary. The GROUP BY region also needs a shuffle by region for the final aggregation.
  5. Stage 2 — join + aggregate (Photon). Co-located rows are joined and summed per region, again vectorized where Photon is engaged.
  6. AQE adjusts along the way. As stage 1 finishes, AQE sees the actual data sizes — it might coalesce the shuffle partitions if they came out tiny, flip to a broadcast join if the filtered customers turned out small, or split a skewed customer_id partition.
  7. Result to driver, then to you. Final partitions return to the driver and back to your notebook or BI tool.

Every lever in Part 3 targets one of these steps: data layout makes step 2 skip more files; broadcast joins eliminate the step-4 shuffle; partition tuning right-sizes step-3 and step-5 tasks; avoiding UDFs keeps steps 3 and 5 in Photon; fixing skew rescues step 5 from one straggler task.

What to take into Part 3

Three sentences hold the whole article. The Delta log decides how much data you even read — via statistics and file skipping, before compute starts. Shuffles are the expensive stage boundaries, created by wide operations, and most performance problems are too much shuffle or skewed shuffle. Photon makes the compute that does happen faster, in vectorized C++, as long as you don't knock the query off the Photon path with UDFs.

Hold those three and Databricks performance stops being folklore. In Part 3 we turn each into concrete, measurable tuning — AQE, partition sizing, join strategy, skew handling, caching, and the data-layout decisions (OPTIMIZE, Z-order, liquid clustering) that make data skipping actually work.