Building a HIPAA-Compliant Health Data Lakehouse on Databricks

🧱 This is Part 4 of a 4-part series: Databricks Deep Dive

The Data Intelligence Platform: A Practitioner's Overview
Internals: Photon, the Delta Log, and How a Query Actually Runs
Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout
Building a HIPAA-Compliant Health Data Lakehouse (you are here)

The first three parts of this series were about the platform in the abstract — what Databricks is, how it works under the hood, how to make it fast. This one is the finale because it's where the abstraction meets a domain that punishes hand-waving: healthcare. When the data is protected health information and genomic results, you don't get to design the pipeline first and bolt compliance on afterward. The governance is the architecture. So this is an end-to-end build of a health data lakehouse where I made the engineering and the compliance decisions together — because in this domain that's the only way they hold.

If you've followed my clinicogenomics work on Snowflake, you'll recognize the shape of the problem — linked clinical and genomic data, the same HIPAA constraints, the same de-identification headaches. This is the Databricks answer to the same class of problem, and the contrast is instructive in its own right.

The shape of the data, and why it's hard

Health data is heterogeneous in a way that breaks naive pipelines. In one platform you're ingesting:

HL7v2 messages — the decades-old pipe-delimited standard still carrying most real-time hospital traffic (admissions, lab orders, results).
FHIR resources — the modern JSON/REST standard, deeply nested, the direction everything is moving.
Batch clinical extracts — claims, EHR dumps, registries, in everything from CSV to fixed-width.
Genomic outputs — annotated variants and feature tables derived from VCFs, which are large and have their own scaling profile.
Unstructured clinical notes and documents — free text, PDFs, sometimes images.

The lakehouse is genuinely well-suited to this precisely because it doesn't force everything into one shape up front — you can land raw, refine progressively, and keep structured and unstructured side by side under one governance layer. That progressive refinement is the medallion architecture, and here it's not a style choice; it's how you keep raw PHI quarantined from the analytics surface.

The medallion architecture, with compliance baked into each layer

graph LR
    subgraph SRC["Sources"]
        HL7["HL7v2 feeds"]
        FHIR["FHIR API / bulk export"]
        BATCH["Claims / EHR extracts"]
        GEN["Genomic feature tables"]
    end
    subgraph BRONZE["🥉 Bronze — raw, immutable"]
        B["As-received, append-only
full PHI · locked-down access
audit of every ingest"]
    end
    subgraph SILVER["🥈 Silver — conformed"]
        S["Parsed to FHIR-aligned model
validated · deduplicated
PHI tagged · masking applied"]
    end
    subgraph GOLD["🥇 Gold — analytics-ready"]
        G["Cohorts · features · marts
de-identified / limited datasets
what most users actually touch"]
    end
    HL7 --> B
    FHIR --> B
    BATCH --> B
    GEN --> B
    B --> S --> G

Bronze is the raw landing zone — full PHI, append-only, access restricted to the ingestion service and a tiny break-glass group. Silver is where parsing, validation, deduplication, and PHI tagging happen. Gold is the de-identified or limited-dataset surface most analysts and models ever see. The layering is itself a privacy control: the blast radius of raw PHI shrinks at each step.

Bronze: land everything, change nothing

The discipline at bronze is to capture data exactly as received and never transform it — because in a regulated setting your raw layer is also your evidence of what arrived and when. HL7v2 lands as raw messages, FHIR bundles land as raw JSON, all append-only into Delta. Auto Loader handles incremental file ingestion; for the streaming HL7 feeds, structured streaming writes into bronze continuously. Access to bronze is the most restricted in the whole platform.

Silver: parse, conform, validate, tag

Silver is where the real engineering lives. HL7v2 segments and nested FHIR resources get flattened into a conformed, FHIR-aligned relational model. This is also where I lean on Lakeflow Declarative Pipelines (from Part 1) for the data-quality expectations — because in healthcare a bad parse isn't a cosmetic bug, it's a patient record that's silently wrong:

import dlt
from pyspark.sql.functions import col

@dlt.table(comment="Conformed observations, FHIR-aligned")
@dlt.expect_or_drop("valid_patient", "patient_id IS NOT NULL")
@dlt.expect_or_drop("valid_code", "observation_code IS NOT NULL")
@dlt.expect("plausible_value", "value_numeric BETWEEN -1000 AND 100000")
def observation_silver():
    return (
        dlt.read_stream("observation_bronze")
           .transform(parse_fhir_observation)   # nested JSON -> columns
           .dropDuplicates(["observation_id", "version_id"])
    )

The expect_or_drop expectations quarantine records that fail hard rules; the softer expect tracks quality without dropping, so you can monitor parsing drift. Silver is also where every PHI column gets tagged in Unity Catalog — which is what makes the next section possible.

Governance: where Unity Catalog earns its place

This is the part that justifies Databricks for regulated data, and it's the through-line from Part 1. Three controls did the heavy lifting.

Tag PHI once, mask it everywhere

Rather than hand-writing masking on each table, I tag columns as PHI in Unity Catalog and attach a column-mask function that keys off the caller's group. Clinicians see real values; analysts see masked ones; the policy lives in one place and applies across every workspace and query path:

-- A reusable masking function: real value only for authorized groups
CREATE FUNCTION hc.gov.mask_phi(val STRING)
  RETURN CASE
    WHEN is_account_group_member('phi_authorized') THEN val
    ELSE 'REDACTED'
  END;

-- Apply it to a column
ALTER TABLE hc.silver.patient
  ALTER COLUMN mrn SET MASK hc.gov.mask_phi;

Row-access policies for cohort and consent boundaries

Masking hides columns; row-access policies hide rows — essential when a research team is only permitted to see patients enrolled in their study, or consent restricts secondary use. The same pattern: a function evaluated per query, returning only the rows the caller's role is entitled to.

Lineage and audit as compliance evidence

The control that auditors actually care about is provenance: prove that a number in a report traces back to source, and prove who accessed PHI and when. Unity Catalog's column-level lineage and access audit logs give both, automatically, as a property of running queries through the catalog. I've written about why this evidence layer is the real regulatory requirement — in banking and across the life-sciences regulatory stack — and on Databricks it's a built-in rather than something you assemble.

De-identification is harder than masking, especially with a genome. Masking an MRN does not de-identify a dataset. HIPAA offers two paths — Safe Harbor (strip 18 specified identifiers) and Expert Determination (a qualified expert certifies the re-identification risk is very small). The catch I keep flagging: a genome isn't on Safe Harbor's list of 18 identifiers, yet it's inherently identifying — so a clinicogenomic gold dataset generally needs Expert Determination, not Safe Harbor. The lakehouse gives you the tooling to implement a de-identification scheme; it can't tell you which one is defensible. That's a determination, made with compliance, before gold is shared. I worked through this nuance in depth in the Snowflake clinicogenomics governance piece, and it applies identically here.

The genomics scale problem

Genomic feature tables are where the performance lessons from Part 3 stop being academic. Variant data is tall and skewed — some genomic regions and some samples carry far more rows than others, which is textbook data skew at join time (joining variants to a gene annotation reference, or to per-sample clinical features). The fixes from Part 3 are exactly the ones that work: enable AQE skew-join handling, broadcast the (relatively small) annotation reference, and lay out the variant tables with liquid clustering on the columns you actually filter and join on — typically chromosome/position and sample/patient id — so the Delta log can skip the genomic regions a given cohort query doesn't need.

This is the payoff of having read the internals: "the variant join is slow" isn't a mystery, it's a recognizable skew-plus-layout problem with a known playbook.

The platform-level guardrails

Around the data architecture sit the account-level controls that make the whole thing HIPAA-defensible, and they're preconditions, not afterthoughts:

A Business Associate Agreement with Databricks (and your cloud provider) — you cannot process PHI on any platform without one. This is procurement's day-one job, not a technical detail.
Customer-managed encryption keys for data at rest, and encryption in transit everywhere.
Network isolation — private connectivity between the compute plane and storage, no public endpoints for the workspace (the control-plane/compute-plane split from Part 1 is what makes this clean: your data never traverses the public internet to Databricks).
The compliance security profile for the workspace, which enforces the hardened configuration HIPAA workloads expect.
Audit log delivery to your own storage, so the access record is yours and immutable.

Lessons learned

Governance first, or governance never. Start on Unity Catalog with PHI tagging from the first bronze table. Retrofitting masking and lineage onto a year-old Hive-metastore lakehouse is the migration nobody enjoys, and in a regulated shop it's also a period of unquantified risk.
The medallion layering is a privacy control, not just a quality one. Treat bronze as a locked vault, do the sensitive work in silver, and make gold the de-identified surface. The shrinking blast radius of raw PHI is the point.
Masking ≠ de-identification. Know the difference, get Expert Determination where Safe Harbor doesn't reach (it doesn't reach genomes), and make that a compliance decision before gold leaves the building.
Genomics is a skew problem. Everything from Part 3 applies — AQE skew handling, broadcast the reference, cluster on your join/filter columns.
The BAA and network setup are the real day one. The data engineering is the easy part; the legal and networking preconditions gate everything and take longer than anyone budgets.

Closing the series

We started four articles ago with a one-sentence definition: a control plane orchestrating compute in your cloud, over open tables in your storage, with one governance layer across all of it. Healthcare is the domain that proves why each clause of that sentence matters — your data staying in your storage clears the security review, the open format keeps the bioinformatics tooling working, and the single governance layer is what turns "we think this is compliant" into "we can prove it." The internals (Part 2) explained why it performs; the tuning (Part 3) made it perform; and here the governance made it allowed. In a regulated domain, all three have to be true at once — which is, in the end, the whole argument for a lakehouse over a pile of disconnected services.

🧱 The Databricks Deep Dive series

The Data Intelligence Platform: A Practitioner's Overview
Internals: Photon, the Delta Log, and How a Query Actually Runs
Spark Performance Optimization: AQE, Shuffle, Skew & Data Layout
Building a HIPAA-Compliant Health Data Lakehouse (this article)