The Databricks Data Intelligence Platform: A Practitioner's Overview

🧱 This is Part 1 of a 4-part series: Databricks Deep Dive

The Data Intelligence Platform: A Practitioner's Overview (you are here)
Internals: Photon, the Delta Log, and How a Query Actually Runs
Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout
Building a HIPAA-Compliant Health Data Lakehouse

Databricks is one of those platforms that everyone has an opinion about and surprisingly few people can describe accurately. Ask ten engineers what it is and you'll get "managed Spark," "a lakehouse," "Delta Lake," "an ML platform," and "expensive" — all true, all partial. After running it in anger across a couple of organizations, the description I've settled on is more useful: Databricks is a control plane that orchestrates compute in your cloud account, sitting on top of open table formats in your own object storage, with a governance layer stretched across all of it. Once that sentence makes sense, everything else clicks into place.

This is the first of four pieces. Here I'm mapping the platform — the architecture, the components that matter, and an honest read on what it's genuinely good at. Part 2 goes under the hood (Photon, the Delta transaction log, how a query actually executes). Part 3 is the performance-tuning playbook. Part 4 builds something real on it — a regulated health data lakehouse. If you've read my Fabric vs Databricks comparison, this is the deeper Databricks half of that story.

The split that explains everything: control plane vs compute plane

The single most clarifying fact about Databricks is that it is split in two, and the two halves live in different places.

The control plane is the Databricks-managed SaaS layer — the web UI, notebooks, job scheduler, cluster manager, Unity Catalog metadata, query history. It runs in Databricks' own cloud account. Your code and notebooks live here; your data does not.
The compute plane is where the actual work happens: clusters of VMs that spin up in your own cloud account (or, for serverless, in a Databricks-managed account with strict isolation), read from your object storage, crunch the data, and write results back. When you "run a notebook," the control plane provisions compute in the compute plane, ships your code to it, and streams results back.

graph TD
    subgraph CP["Control plane — Databricks-managed SaaS"]
        UI["Workspace UI · notebooks"]
        JOBS["Jobs / Lakeflow scheduler"]
        UC["Unity Catalog (governance + metadata)"]
        QH["Query history · cluster manager"]
    end
    subgraph CMP["Compute plane — runs in / near your cloud account"]
        CL["Clusters (Spark + Photon)"]
        SQLW["SQL Warehouses"]
        SRV["Serverless compute"]
    end
    subgraph STORE["Your cloud object storage"]
        S3["S3 / ADLS / GCS — Delta & Parquet files"]
    end
    UI --> JOBS --> CL
    UC -. governs .-> CL
    UC -. governs .-> SQLW
    CL --> S3
    SQLW --> S3
    SRV --> S3

Code and metadata live in the control plane; data lives in your storage; compute runs in the compute plane and reads your storage directly. This separation is why "your data never leaves your account" is true, and why networking and IAM setup is most of the day-one effort.

This architecture is the reason for two things people trip over. First, your data genuinely stays in your storage — Databricks reads it with compute you control, which is exactly why it passes the security reviews that matter for regulated data (more on that in Part 4). Second, most of the day-one pain is networking and IAM — VPCs, private endpoints, instance profiles, storage credentials. The data work is easy; convincing the compute plane and your storage to trust each other is the part that eats the first week.

The foundation: Delta Lake and open table formats

Everything in Databricks sits on Delta Lake: Parquet files plus a transaction log that turns a pile of files in object storage into something with ACID transactions, time travel, schema enforcement, and efficient metadata. I'll dig into how the log actually works in Part 2 — for now the thing to internalize is that a Delta table is just files in your bucket, plus a _delta_log folder that records the truth about which files belong to which version of the table.

That openness is strategic, not incidental. Because the format is open (and increasingly interoperable with Iceberg via UniForm), other engines can read your tables, and you're not locked into reading your own data only through Databricks. I wrote about why this matters across the industry in Open Table Formats; on Databricks it's the load-bearing assumption beneath the whole "lakehouse" pitch — you get warehouse-grade reliability on data-lake-grade storage you own.

The governance layer: Unity Catalog

If Delta is the foundation, Unity Catalog is the thing that makes Databricks an enterprise platform rather than a powerful toolbox. It's the single governance layer across every workspace, every cluster, and every asset type — tables, views, volumes (for files), ML models, and functions — with a three-level namespace (catalog.schema.object) and one place to define access policies that apply everywhere.

I gave Unity Catalog its own article, so I won't repeat it all, but the points that matter for understanding the platform shape:

One permission model for everything. The same grant system covers structured tables, unstructured files (volumes), and models. You stop having three different access stories.
Lineage and audit are built in. Column-level lineage and an audit log come from the catalog itself, which is why Databricks is a defensible answer for regulated workloads — you can prove who touched what.
It spans workspaces. A table governed once is governed for every team, every notebook, every SQL query, without per-workspace re-grants.

If you're starting fresh, start with Unity Catalog from day one. The painful migrations I've seen were teams who built on the legacy Hive metastore for a year and then had to retrofit governance. Beginning with UC costs nothing extra and saves you that migration.

The engines: Spark, Photon, and SQL Warehouses

Databricks runs two execution engines, and they cooperate rather than compete. Apache Spark is the distributed processing framework — the thing that splits a job across a cluster. Photon is a vectorized C++ engine that transparently accelerates a large fraction of Spark SQL and DataFrame operations by executing them natively instead of on the JVM. You don't rewrite anything to use Photon; you turn it on and eligible operations run faster. The mechanics — why vectorized C++ beats JVM row-at-a-time, and what's not eligible — are Part 2's subject.

How you access that compute comes in a few shapes, and picking the right one is most of cost control:

Compute type	What it's for	Notes
All-purpose clusters	Interactive notebook work, exploration, development	Stay alive while you work; the easiest way to burn money if you forget auto-termination.
Job clusters	Scheduled production pipelines	Spun up for the job, torn down after — cheaper per run, no idle cost.
SQL Warehouses	BI / SQL analytics, dashboards, ad-hoc SQL	Photon-powered, fast to start (serverless variant starts in seconds), what your analysts and BI tools connect to.
Serverless	Notebooks, jobs, and SQL without managing VMs	Databricks manages the pool; near-instant start, you pay only for use. Where the platform is clearly heading.

The pipeline layer: Lakeflow (and Delta Live Tables)

Raw Spark jobs are flexible but you own all the orchestration, error handling, and data-quality plumbing yourself. Lakeflow Declarative Pipelines — the evolution of what was Delta Live Tables — is the managed alternative: you declare the transformations and the quality expectations, and the platform handles dependency ordering, incremental processing, retries, and monitoring. You write what the data should look like; it figures out how to keep it that way.

This is also where the medallion architecture shows up — the bronze → silver → gold pattern of progressively refining raw data into clean, business-ready tables. It's not a Databricks invention, but the platform leans into it hard, and it's the default mental model for organizing a lakehouse. I put it to real work in Part 4.

The AI/ML stack

The "Data Intelligence" in the name is the bet that data and AI belong on one platform. In practice that's managed MLflow for experiment tracking, model registry, and deployment; a feature store governed by Unity Catalog; Mosaic AI for model serving, fine-tuning, and the agent/RAG tooling; and AI/BI Genie for natural-language analytics over your governed tables. The honest take: the data-engineering and ML-platform pieces are genuinely strong and tightly integrated; the generative-AI tooling is capable and improving fast, though for a pure RAG or agent build I still evaluate it against the cloud-native and specialist options I cover in the RAG series rather than assuming Databricks wins by default.

So what is it actually good at?

Stripping away the marketing, here's my honest read after running it:

Large-scale data engineering on open formats. This is the core competency and it's excellent. If you have serious volume and you want it on storage you own in a format other tools can read, Databricks is hard to beat.
Unified governance across data and ML. Unity Catalog spanning tables, files, and models with built-in lineage is a real differentiator, especially for regulated industries.
The data-science-to-production path. Notebook to MLflow to served model, all governed, is genuinely smooth.
Multi-cloud consistency. The same platform behaves the same on AWS, Azure, and GCP — valuable if you're not all-in on one cloud.

And the honest caveats: the consumption pricing rewards discipline and punishes its absence — idle all-purpose clusters and unsuspended warehouses are where budgets quietly die (FinOps on data platforms covers the governance for that). It's a platform engineer's tool more than an analyst's; the power comes with surface area. And for pure SQL-warehouse BI without the ML and engineering ambitions, a simpler tool may serve you better — which is exactly the trade-off I worked through in the Fabric comparison.

Where the series goes next

That's the map. The thing I want you to keep is the opening sentence: a control plane orchestrating compute in your cloud, over open tables in your storage, with one governance layer across all of it. Everything else — Photon, Lakeflow, SQL Warehouses, MLflow — hangs off that frame.

In Part 2 we go under the hood: how the Delta transaction log gives you ACID on object storage, what Photon is actually doing differently, and the precise path a query takes from your notebook to bytes on disk and back. Understanding that is what turns "the job is slow" from a mystery into a diagnosis — which is exactly what Part 3 is about.

🧱 Continue the series

The Data Intelligence Platform: A Practitioner's Overview (this article)
Internals: Photon, the Delta Log, and How a Query Actually Runs →
Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout →
Building a HIPAA-Compliant Health Data Lakehouse →