RWE Clinicogenomics on Snowflake, Part 2: PHI Governance, Tokenization, Clean Rooms & Data Contracts

🧬 This is Part 2 of a 3-part series: RWE Clinicogenomics on Snowflake

The Migration, and the TCO Case
Governing PHI, Tokenization, Clean Rooms & the Data Contract App (you are here)
Cortex AI, Genomics in Snowpark, and the Analytics Surface

In Part 1 I argued that the database swap was the least important thing that happened in this migration. This is the part that actually justified it. The real-world evidence team didn't move to Snowflake for faster queries — Redshift was fine. They moved because governing PHI and collaborating with partners were both manual, fragile, and slow, and they wanted those to become properties of the platform rather than acts of heroism.

So this part is about three things that turned out to be one thing: how we governed sensitive data so the controls held without anyone remembering to apply them, how we let partners work with the data without ever handing over a row, and the data contract management app we built to stop the whole arrangement from quietly drifting out of compliance. I'll be concrete, because the value is in the mechanics.

The compliance frame we designed against

Before any of the controls make sense, it's worth being explicit about what we were building for, because in real-world evidence the regulations aren't a footnote — they're the design spec. A clinicogenomic dataset sits under more overlapping regimes than almost anything else in data engineering, and "we encrypted it" satisfies none of them on its own. The load-bearing ones for this platform:

HIPAA, obviously — both the Privacy Rule and the Security Rule. The first architectural consequence is unglamorous: you cannot put PHI on a cloud platform without a Business Associate Agreement, so BAAs with both Snowflake and AWS were table stakes before a single row moved. The Privacy Rule's minimum necessary principle is what our row-access and masking policies actually implement — an analyst sees the least data their task needs, by default.
De-identification has two doors, and genomics only fits through one. HIPAA gives you Safe Harbor (strip the 18 listed identifiers, §164.514(b)(2)) or Expert Determination (a qualified statistician documents that re-identification risk is "very small," §164.514(b)(1)). Here's the catch people miss: a genome is inherently identifying and isn't one of Safe Harbor's 18 identifiers, so you cannot Safe-Harbor a clinicogenomic dataset into being "de-identified." The honest path is Expert Determination plus contractual and technical controls — which is exactly why tokenization, gated re-identification, and clean rooms carry so much weight here. They aren't belt-and-braces; they're how you reach a defensible risk position at all.
Secondary-use rules. Most RWD was collected for care, not research, and consent at the point of care rarely covers future research. So secondary use runs through either de-identification or an IRB — under the Common Rule, an IRB can approve research on identifiable data, or waive consent under specific conditions. That governance lives outside the platform, but the platform has to enforce it, which is why the data contract's allowed-use and the consent categories in our row-access policies aren't decoration.
Source-specific overlays. Some data carries extra law. Substance-use-disorder records under 42 CFR Part 2 are more tightly restricted than ordinary PHI and often need to be segregated or separately consented; genetic information sits under GINA, which bars its use in employment and health-insurance decisions. Practically, that meant tagging those categories distinctly so policy could treat them differently, rather than lumping everything "sensitive" together.

None of this changes the controls I'm about to describe — it explains why they exist, and why "just mask it" was never going to be enough. Each mechanism below maps to a specific obligation:

Control	What it satisfies
BAAs with Snowflake + AWS	HIPAA precondition for processing PHI in the cloud
Classification + tagging	Knowing where PHI / genetic / Part 2 data lives — the precondition for every other control
Tag-based masking + row access	HIPAA minimum necessary; consent- and IRB-scoped access
Deterministic tokenization + gated re-identification	Expert Determination risk posture; linkage without exposing identifiers
Tri-Secret Secure (CMK) + access history	HIPAA Security Rule — encryption, audit controls, revocability
Clean rooms + minimum cohort threshold	Keeping shared outputs below a "very small" re-identification risk

Governing PHI as a property of the platform

In a regulated dataset, governance can't be a convention that people remember to follow. It has to be something the platform enforces whether or not anyone remembers. We leaned on three layers that, taken together, turned "we mask in views, mostly" into an actual control plane.

Discovery first

Snowflake's data classification will scan columns and tag what looks sensitive — names, dates of birth, identifiers — assigning system tags for the privacy and semantic category. We ran it across the whole estate as step one, then reviewed and corrected the results. The output is the thing the old stack never had: a queryable inventory of where PHI actually lives. Once a column is tagged, everything downstream can key off the tag instead of a hardcoded column list. I'll say plainly that the classifier gives you a strong draft, not a finished answer — it flagged things that weren't sensitive and missed a couple of oddly-named columns that were, so budget a human pass.

Tag-driven masking and row access

Rather than attach a masking policy to each column by hand, we attached policies to tags. Tag a column as PHI and the masking follows it everywhere — including into any new table built from it. Row access policies handled the other axis: a given analyst only sees rows for the sites and consent categories they're cleared for. Defining it once and letting the tag carry it is the difference between governance that holds and governance that erodes with every new model.

Tokenization at the boundary, with re-identification gated

For the genuinely identifying fields we used external tokenization — a masking policy that, for most roles, replaces the value with a token from an external vault, and only returns the real value to a tightly held re-identification role. The token is deterministic, which matters enormously later for clean-room joins, because two parties can join on the same tokenized patient key without either of them ever seeing the underlying MRN. This is also the mechanism that makes an Expert Determination defensible in practice: identifiers are removed from the working data, re-identification is gated behind a role almost nobody holds, and the linkage that research needs still works — privacy-preserving record linkage rather than a shared spreadsheet of MRNs.

-- Tokenize identifiers by default; only a privileged role can re-identify.
CREATE MASKING POLICY governance.tokenize_mrn AS (val STRING) RETURNS STRING ->
  CASE
    WHEN IS_ROLE_IN_SESSION('PHI_REIDENTIFY') THEN val
    ELSE tokenization.ext.tokenize(val)        -- external function → token vault
  END;

-- Attach to the tag, not the column. Every PHI-tagged identifier inherits it.
ALTER TAG governance.phi SET MASKING POLICY governance.tokenize_mrn;

Underneath all of that, the encryption story was easy to defend to the security team: Snowflake encrypts everything by default, and we wired customer-managed keys through AWS KMS (Tri-Secret Secure) so the organisation held a key in the hierarchy and could cut access if it ever needed to. Combined with access history for the audit trail, the compliance review went from "show me the masking views" to "show me the policy and the tag," which is a much shorter conversation. The one operational wrinkle worth flagging: customer-managed keys mean you now own key rotation and the blast radius of revoking a key, so that runbook needs to exist before go-live, not after.

Collaboration without handing over the data

Remember the original sore point — sharing. This is where the move actually paid for itself, and it came in two flavours.

For distributing curated, de-identified RWD products to partners who just need to query them, Secure Data Sharing replaced the old extract-and-ship ritual entirely. You grant access to a governed object; the consumer queries live data with their own compute; nothing is copied and nothing is emailed. Internally this also killed a surprising amount of duplicate data, because teams stopped making private copies "just to be safe."

The more interesting flavour was Data Clean Rooms, which is what finally made the scary collaborations boring.

graph TD
    subgraph Sponsor["Pharma sponsor"]
        SP["Trial population\n(tokenized patient key)"]
    end
    subgraph Provider["RWE provider (us)"]
        PR["Cohort + outcomes\n(tokenized patient key)"]
    end
    DCR["Data Clean Room\njoin on deterministic token\nrow-level data never exposed\nmin cohort-size threshold"]
    OUT["Aggregate output only\noverlap counts · cohort sizes\nfeasibility · external control"]
    SP --> DCR
    PR --> DCR
    DCR --> OUT

The clean-room pattern that made partner collaboration routine. Both sides contribute data keyed on the same deterministic patient token; the join happens inside the clean room; neither party sees the other's patient-level rows; and only aggregate results leave, gated by a minimum cohort-size threshold so a small group can't be reverse-engineered back to individuals.

The canonical case: a sponsor wants to understand how many patients in our cohort overlap with their trial population, or wants an external control arm, and neither side is allowed to see the other's patient-level rows. With a clean room, both parties bring their data, the join happens on the deterministically tokenized patient key, and the only thing that comes out is aggregate — with a minimum cohort-size threshold so you can't reverse a small group back to individuals.

That overlap-and-aggregate pattern is exactly the shape of a lot of RWE collaboration: cohort sizing, feasibility, external controls, safety signal checks across populations. Being able to say "we never move patient data and the partner never sees a row" turned month-long negotiations into something a data team could stand up in days. If I had to name the single feature that justified the whole programme to the business, it's this one. The honest friction: analysts used to poking at row-level data find the minimum-threshold, aggregate-only discipline frustrating, so set that expectation before they hit the wall, not after.

The minimum cohort-size threshold isn't only a UX nicety — it's the output-side analogue of the de-identification problem. An aggregate small enough to single out an individual is, for practical purposes, identifiable, so the threshold (plus cell suppression on sparse breakdowns) is how you keep what leaves the clean room below the same "very small risk" bar that Expert Determination demands of the inputs. Same standard, applied to the answer instead of the data.

The data contract management app

Here's the problem none of the above solves on its own. You've got 40-odd sources feeding a vault of clinical and genomic data, that data feeds studies and partner shares, and every one of those consumers has an implicit assumption about what a column means, what type it is, how fresh it is, and whether it contains PHI. The day a source quietly renames a field or changes a unit, a cohort definition breaks somewhere downstream and nobody notices until a study analyst does — which in this domain is a very bad time to find out.

The old stack handled this with tribal knowledge and a wiki that was always slightly wrong. We replaced it with explicit, enforced data contracts, and a small app to manage them. The principle: a contract is the single source of truth that both describes a dataset and provisions the governance for it, so the description and the reality can't drift apart.

What a contract is

Each governed dataset (and each partner share) has a contract — a version-controlled YAML file in git, owned by a named human. It declares schema and types, the semantic and PHI classification of each field, a freshness SLA, the quality expectations, who is allowed to consume it and for what, and the retention rule.

# contracts/clinical/observation.yml
dataset: rwe.clinical.observation
owner: data-platform@org   # a human team, not a service account
description: Patient clinical observations (FHIR-aligned), one row per observation.
sla:
  freshness: 24h           # must be no more than a day stale
  availability: business-critical
schema:
  - name: patient_token
    type: string
    classification: phi            # drives tokenization masking policy
    constraints: [not_null]
  - name: observation_code
    type: string
    semantic: loinc
    constraints: [not_null]
  - name: value_num
    type: number(18,4)
  - name: effective_ts
    type: timestamp_ntz
    constraints: [not_null]
quality:                            # mapped to Data Metric Functions
  - column: patient_token
    metric: null_count
    threshold: 0
  - column: observation_code
    metric: duplicate_count
    threshold: 0
  - dataset: freshness
    metric: freshness
    threshold_minutes: 1440
consumers:
  - name: oncology-rwe-study
    use: cohort-definition
  - name: partner-share:sponsor-x
    use: clean-room-only          # may only be reached through a clean room
retention: 7y

How the contract is enforced

A YAML file nobody enforces is just a nicer wiki, so the whole point is that breaking a contract blocks something. We wired four enforcement paths off the same file:

graph TD
    C["Data contract (YAML in git)\nschema · PHI tags · SLA · quality · consumers"]
    subgraph Enforce["Enforcement, generated from the contract"]
        SCHEMA["Schema — dbt model contracts\nfail the build on drift"]
        GOV["Governance — apply Snowflake tags\n+ tag-based masking from PHI class"]
        DQ["Quality — schedule Data Metric Functions\nalert on threshold breach"]
        SHARE["Sharing — provision share / clean-room\ndataset from 'consumers'"]
    end
    APP["Streamlit-in-Snowflake registry app\nbrowse · live compliance · request access"]
    C --> SCHEMA
    C --> GOV
    C --> DQ
    C --> SHARE
    SCHEMA --> APP
    GOV --> APP
    DQ --> APP
    SHARE --> APP

The contract is the single source that both documents the dataset and provisions its governance. Schema is enforced at build time by dbt model contracts; PHI classifications generate the Snowflake tags and tag-based masking; quality expectations become scheduled Data Metric Functions; and the consumer list provisions the shares and clean-room datasets. A Streamlit-in-Snowflake app renders the registry and live compliance state.

Schema, at build time. The contract's schema block maps to dbt model contracts, so a producer that changes a type or drops a column fails the build instead of silently shipping a breaking change to a study. Fail fast, upstream, where the producer can see it.
Governance, from the PHI class. A reconciliation job reads each contract's classifications and makes the actual Snowflake object tags match — which in turn pulls in the tag-based masking and tokenization from earlier. If someone adds a PHI column to a table without declaring it in the contract, the drift shows up as a violation rather than as an unmasked identifier in production.
Quality, continuously. The quality block maps to Snowflake Data Metric Functions scheduled on the table. Null counts, duplicate keys, and freshness are measured against the contract's thresholds, and breaches alert the owner. dbt tests guard the build; DMFs guard the live table between builds.
Sharing, from the consumer list. The consumers block is what actually provisions a Secure Data Share or a clean-room dataset. Because the grant is generated from the contract, the agreement and the access can't diverge — and a consumer marked clean-room-only physically can't be handed a direct share. Each consumer's entry also records the legal basis for that use — the IRB protocol number or the consent/de-identification determination it relies on — so the contract doubles as the audit answer to "under what authority does this party have this data?"

-- The contract's quality block compiles to DMF associations like this.
ALTER TABLE rwe.clinical.observation
  SET DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
ALTER TABLE rwe.clinical.observation
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT      ON (patient_token);
ALTER TABLE rwe.clinical.observation
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.DUPLICATE_COUNT ON (observation_code);
ALTER TABLE rwe.clinical.observation
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.FRESHNESS       ON (effective_ts);
-- Results flow to SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS,
-- which the registry app reads to show live compliance per contract.

The app itself

The registry is a Streamlit-in-Snowflake app — deliberately, because it then runs next to the data and inherits the same RBAC, so it can show governed compliance state without becoming a new way to leak PHI. It lists every dataset and its contract, shows live status (last freshness, current DMF results, whether the live tags match the declared classifications, whether schema is in drift), names the owner, and tells you who consumes the dataset and how. Consumers request access through it, which routes to the owner rather than to a help desk that doesn't understand the data. Auditors get a rendered, point-in-time view of any contract and its compliance — which, in a regulated shop, is worth a great deal on its own.

Building it in Snowflake rather than as an external web app was a governance decision, not a convenience one: an external app would have needed its own credentials, its own copy of metadata, and its own audit story. In-Snowflake, it queries the account's own metadata and quality results under the caller's role, and there's nothing new to secure.

A contract only works if breaking it blocks something. The trap with data contracts is treating them as documentation. The value appears only when a violated contract fails a build, raises a quality alert, or refuses to provision a share. Make the owner a named human, make the contract the single thing that provisions governance (so it can't drift from reality), and make at least one consequence of breaking it impossible to ignore. Everything else is a prettier wiki.

What carries into Part 3

Governance and sharing are the half of the platform that justified the move. The other half is what the analysts and scientists actually do on top of it: running AI over clinical notes without exporting PHI, processing genomics at scale in Snowpark, and building cohorts and dashboards in Notebooks and Power BI — all on the governed, contracted data we've just described. That's Part 3.

🧬 Continue the series

The Migration, and the TCO Case
Governing PHI, Tokenization, Clean Rooms & the Data Contract App (this article)
Cortex AI, Genomics in Snowpark, and the Analytics Surface →