← Back to Blog

RWE Clinicogenomics on Snowflake — Part 3: Cortex AI, Genomics in Snowpark, and the Analytics Surface

🧬 This is Part 3 of a 3-part series: RWE Clinicogenomics on Snowflake

  1. The Migration, and the TCO Case
  2. Governing PHI, Tokenization, Clean Rooms & the Data Contract App
  3. Cortex AI, Genomics in Snowpark, and the Analytics Surface (you are here)

Part 1 moved the platform; Part 2 governed it and made sharing safe. This last part is about what the scientists and analysts actually do on top of all that plumbing — the work that produces evidence rather than infrastructure. Three things: running AI over the unstructured clinical text without ever exporting PHI, doing genomics-scale processing in Snowpark, and giving people a sane analytics surface in Notebooks and Power BI. I'll spend most of the words on the genomics and the analytics surface, because those are the parts I get asked to expand on most.

Cortex AI on the messy 80%

A lot of the value in real-world data is locked in text — clinical notes, pathology reports, the free-text fields nobody structured. The old way of getting at that meant exporting the notes to wherever the model lived, which for PHI is a non-starter. Running Cortex's AI functions changes the geometry: the model call happens inside Snowflake, on data that never leaves the perimeter, billed against the same account.

The functions had gone GA the previous May, and for this kind of work the useful ones are unglamorous. We used CLASSIFY_TEXT for first-pass phenotyping — flagging notes that mention a condition of interest — EXTRACT_ANSWER to pull specific documented facts like performance status or a biomarker result, and SUMMARIZE to give a reviewer a quick read on a long note. A simplified version of the phenotyping pass:

SELECT
    note_id,
    patient_token,
    SNOWFLAKE.CORTEX.CLASSIFY_TEXT(
        note_text,
        ['type 2 diabetes', 'chronic kidney disease', 'neither']
    )                                                          AS phenotype_flag,
    SNOWFLAKE.CORTEX.EXTRACT_ANSWER(
        note_text,
        'What is the documented ECOG performance status?'
    )                                                          AS ecog_extract
FROM clinical.notes
WHERE note_date >= '2020-01-01';

Two caveats I'd put in bold if I could. First, this is a screening tool, not a diagnosis — anything that feeds an actual study definition got human review, and we tracked the model's agreement with chart abstractors before trusting it anywhere near an endpoint. Second, it's easy to run a function over ten million notes and get a bill that makes someone's eyes water. We always sampled a thousand rows, looked at the cost and the quality, then decided. For semantic retrieval over the literature and protocol library we stood up a Cortex Search service, which had gone GA in October and saved us from running a separate vector database for what was, in the end, document search.

Genomics-scale work in Snowpark

The bioinformatics team's first reaction to "we're moving to Snowflake" was, reasonably, "are you taking away my Python?" The answer was no, and this section is the long version of why that worked, because it's the part of the migration I was least sure about going in.

The shape of the data

The genomics side isn't big the way clinical data is big — it's big in a different, nastier way. The annotated-variant tables run to billions of rows: every patient contributes a few million variant calls, most of which are uninteresting, and the analysis almost always wants a thin slice (a gene panel, a region, a set of pathogenic annotations) joined back to the clinical cohort. So the work is dominated by two operations: filtering enormous variant tables down hard, and joining the survivors to clinical features. That profile is exactly what Snowflake's pruning is good at — if you lay the data out for it.

Keeping the Python, moving the compute

Snowpark let the team keep writing Python while the work executed next to the data instead of on a cluster they had to babysit. The variant normalisation and the feature-engineering steps that used to run on EMR became Snowpark DataFrame transforms, which push down to the warehouse as SQL rather than dragging data into a separate engine. Where the logic genuinely needed row-by-row Python — a parsing routine, a custom annotation lookup — we used vectorised Python UDFs and UDTFs so it still ran in-database. The mental shift for the team was small: same language, same libraries for the most part, but no cluster lifecycle to manage and, crucially, no step where patient-linked genomic data left the governed environment.

The clinicogenomic join — variants to clinical outcomes — is the whole point of a platform like this, and doing it inside Snowflake meant the genomic data and the PHI met in one governed place under the masking and row-access policies from Part 2, rather than on some analyst's export.

Polygenic risk scores as a batch job

A representative workload: computing polygenic risk scores across the cohort. A PRS is, mechanically, a big join-and-weighted-sum — take a patient's genotypes at the relevant variants, multiply by published effect weights, and aggregate. In Snowpark that's a few lines, and it runs as a batch over the whole cohort without moving anything:

from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sum_

# variants: (patient_token, variant_id, dosage)   — billions of rows, pruned hard
# weights:  (variant_id, effect_weight)            — the PRS model, thousands of rows
variants = session.table("genomics.variant_dosage")
weights  = session.table("genomics.prs_weights_cad")   # e.g. coronary artery disease

prs = (
    variants
      .join(weights, "variant_id")                       # inner join drops the 99% we don't score
      .with_column("contrib", col("dosage") * col("effect_weight"))
      .group_by("patient_token")
      .agg(sum_("contrib").alias("prs_raw"))
)
prs.write.mode("overwrite").save_as_table("genomics.prs_cad")

When there was an actual trained model rather than a published weight set — say a risk model combining genomic and clinical features — we registered it in the Snowpark Model Registry and ran batch inference from there, so the model had a version, lineage, and a governed home instead of living in someone's notebook. That mattered as much for reproducibility (a regulator asking "which model version produced this number?") as for engineering tidiness.

When you need a real binary: Container Services

Not everything in bioinformatics is expressible as SQL or a Python UDF. Plenty of the field's tooling is a compiled binary that expects a particular runtime, reference files on a local filesystem, and a specific OS. For those, Snowpark Container Services let us bring the container to the data rather than ship the data to the tool — the binary runs inside Snowflake, reads the governed tables, and writes results back, without an export. It went GA on AWS the previous August, which made it a real option for this project rather than a science experiment. Some legacy scripts assumed a local filesystem in ways that needed rework, and I won't pretend that was fun, but the structural win — no patient-linked genomic data leaving the platform — was worth the porting.

Making the big joins actually fast

This is where "it's just Snowflake now" stops being free. The billion-row variant tables needed deliberate physical design to prune well:

  • Cluster the variant tables on the access pattern — for us, roughly (chromosome, position), because analyses filter by genomic region and gene panel. With that clustering, a gene-panel query touches a handful of micro-partitions instead of scanning the genome. Without it, every query is a full scan, and the credit bill tells you so.
  • Filter before you join. The PRS join above is cheap because the inner join against a few-thousand-row weight table discards 99% of variants first. Structuring the work so the giant table is reduced early is the single biggest lever.
  • Size the warehouse to the join, not the average. The cohort-wide genomic jobs are spiky — big and occasional. We ran them on a separate, larger warehouse that auto-suspended aggressively, rather than oversizing the everyday warehouse and paying for headroom nobody used between runs.

"It's on S3 as Iceberg" doesn't exempt you from physical design. The open-format, single-copy story from Part 1 is real, but the laws of pruning still apply. The genomics tables only performed once we clustered them for the way analysts actually query — by region and panel — and structured the joins to shrink the big side first. Treat layout as part of the migration, not an afterthought, or the consumption bill will find you.

The analytics surface: Notebooks and Power BI

All of the above needs a place where people actually work. We ended up with two surfaces for two audiences: Snowflake Notebooks for the data scientists and epidemiologists who write code, and Power BI for everyone who consumes dashboards.

Snowflake Notebooks for the people who write code

The epidemiologists and data scientists live in Snowflake Notebooks now, which reached GA on the warehouse runtime in November. Python and SQL in the same document, running on a Snowpark session, with the Cortex calls available inline — so building a cohort, profiling it, scoring it, and charting the result happens in one place against governed data, without the export-to-laptop habit that compliance dreads. A few things made it stick rather than just being a nicer editor:

  • It runs as the user. A notebook executes under the analyst's own role, so the masking and row-access policies from Part 2 apply exactly as they do everywhere else. An analyst without re-identification rights sees tokens in their notebook, full stop. That's the property that let security sign off on people doing exploratory work directly against patient-linked data.
  • SQL and Python share state. A cell builds a cohort in SQL, the next cell picks it up as a Snowpark or pandas dataframe, the next runs a Cortex function or a model. No glue, no connection strings, no copying result sets around.
  • It's schedulable and version-controlled. The recurring RWE refreshes — rebuild this cohort weekly, re-score these patients — run as scheduled notebooks rather than as someone's local cron, and the notebooks live in git like everything else.

It replaced a scattering of local notebooks and the "can you send me the extract" emails that came with them, which from a PHI standpoint was a quiet but real win.

Power BI for everyone else

For the dashboards we moved to Power BI, partly because the broader organisation had already standardised on it and partly because the semantic-model layer fit how the reporting team wanted to work. Replacing QuickSight wasn't about QuickSight being bad; it was about one BI standard, one set of measures, and reports the rest of the business already knew how to read.

The thing worth flagging for anyone doing the same: the governance you set up in Snowflake follows the data into Power BI, but only if you wire the identity through. We used single sign-on with Microsoft Entra ID so that queries hit Snowflake as the actual viewer, which means the masking and row-access policies are enforced per person at query time. A clinician opening a report sees masked identifiers and only their permitted rows because Snowflake is enforcing it on the way out — not because someone built a second, parallel security model in Power BI. Getting that passthrough right is the single most important configuration step for regulated data on Power BI; skip it and you've quietly recreated the governance problem you just solved.

On the storage-mode question, which always comes up, the trade for this workload broke down like this:

ModeWhat it doesWhere we used it
ImportCaches data in Power BI's engine; fast, but a copyAvoided for anything patient-level — a copy of PHI outside Snowflake is exactly what we were trying not to create
DirectQueryQueries Snowflake live per interaction; governance enforced, no copyThe default for clinical/operational reports where freshness and policy enforcement mattered
Composite + aggregationsAggregates imported, detail via DirectQueryHigh-traffic dashboards — pre-aggregated summaries import safely (no PHI), drill-through hits Snowflake live

DirectQuery has a reputation for being slow, and it can be, so the everyday reporting warehouse was sized and tuned for BI concurrency and we leaned on composite models with aggregations to keep the interactive paths snappy while detail and anything sensitive went live to Snowflake. It's the usual Power BI trade; the only healthcare-specific twist is that "just import it" is off the table for patient-level data.

graph TD
    GOV["Governed data on Snowflake\n(Parts 1 & 2: Iceberg on S3,\nmasking, row access, tokenization)"]
    NB["Snowflake Notebooks\nruns as the user → policies apply\nSQL + Python + Cortex inline"]
    SP["Snowpark + Container Services\ngenomics, PRS, model registry"]
    PBI["Power BI\nSSO via Entra ID → per-viewer policy\nDirectQuery / composite + aggregations"]
    GOV --> NB
    GOV --> SP
    SP --> NB
    GOV --> PBI
          

The analytics surface sits entirely on the governed side of the line. Notebooks and Power BI both query as the end user, so the Part 2 policies are enforced per person; Snowpark feeds scored features (like PRS) back into the same governed tables that Notebooks and Power BI read.

When the evidence has to satisfy a regulator

Everything so far assumes internal analytics and partner collaboration. The bar rises when the RWE is meant to support a regulatory decision — a label expansion, a post-marketing commitment, an external control arm that a reviewer will actually scrutinise. The FDA's RWE program — built on the 21st Century Cures Act and extended by the FDORA mandate — is explicit that the data has to be fit for use and the analysis reproducible, and that turns several things from "nice to have" into requirements. Three shaped the platform directly:

  • Provenance and traceability. You have to show where every data element came from and how it was transformed, source to result. This is precisely what the data contracts and lineage from Part 2 buy you — they stop being good hygiene and become the thing that lets you answer an inspector. FDA also expects that the underlying source data can be made available for inspection, which is a contractual matter with the data originators, not just a technical one, so those agreements have to exist before you rely on a source.
  • Reproducibility. FDA wants the submitted data, code, and algorithms annotated and complete enough to re-run the analysis and land on the same answer. That means versioning the data (Time Travel and the Iceberg snapshots help), the transformation code (it's all in git), and the models — which is the real reason the polygenic-risk and phenotyping models live in the Model Registry with versions rather than in someone's notebook. "Which model produced this number, on which data, with which code?" has to have one unambiguous answer.
  • Part 11-grade systems. If electronic records feed a submission, the systems managing them are expected to meet 21 CFR Part 11 — audit trails, access controls, and the data-integrity properties usually summarised as ALCOA+ (attributable, legible, contemporaneous, original, accurate, and so on). Snowflake's access history plus the policy-enforced access from Part 2 cover much of that, but it's a posture you claim and document deliberately, not one you get for free by being on a good platform.

The AI-derived phenotypes deserve a second mention in this light, because the regulatory framing sharpens the earlier caveat. A Cortex-classified cohort flag is an algorithm-derived variable; if it feeds regulatory evidence, it has to be validated and fit-for-purpose — its agreement against chart review measured and documented, its derivation transparent enough to put in front of a reviewer. "The model seemed good" is an internal note, not a submission. We treated any Cortex output that might touch regulatory evidence as a variable requiring its own validation record, the same as any other derived endpoint.

Lessons learned across the whole programme

Pulling the three parts together, here's what I'd tell the next team doing this, in rough order of how much grief each one saved or cost us:

  • Make tokenization deterministic on day one. We got this right almost by luck, and it turned out to be the hinge for every clean-room collaboration. If your tokens aren't stable across parties and time, your privacy-preserving joins simply don't work. Decide it before you tokenize anything.
  • Treat physical layout as part of the migration. Iceberg-on-S3 doesn't repeal pruning. The genomic tables only performed once clustered for the real access pattern, and the streaming path needed small-file minding. Budget design time for it.
  • Wire identity through to the BI tool. Snowflake's governance only protects Power BI users if queries run as those users via SSO. The passthrough is the whole ballgame for regulated dashboards.
  • Classification is a draft, not an answer. The auto-classifier is a great starting inventory and a poor final authority. A human reviews it, and the data contract (Part 2) is what keeps it honest afterwards.
  • Watch the AI and the consumption bill like the engineering costs they are. A Cortex function over a giant note corpus is a priced batch job; sample and measure first. Auto-suspend, per-workload warehouses, and resource monitors are not optional under a consumption model.
  • Keep something interoperable. Choosing Iceberg over a closed format kept the bioinformatics Spark jobs alive and meant the migration was never an all-or-nothing bet. I'd make that choice again every time.
  • Version your models, not just your data. Putting the risk models in the Model Registry answered the "which version produced this number?" question before a regulator had to ask it.
  • Treat the regulations as the design spec, not a review gate. The single biggest mindset shift: for clinicogenomic RWE, "is this defensible under HIPAA Expert Determination and, if it's going to FDA, fit-for-use and reproducible?" is a question you answer while you design, not one you bolt on before submission. A genome can't be Safe-Harbored, so the whole tokenization-and-clean-room posture exists because the compliance frame demanded it — build it in from the first table.

Was it worth it?

For this team, clearly yes — but I want to be precise about why, because "we moved to Snowflake" is not a reason. They didn't move for raw query speed; Redshift was fine, and as Part 1 showed, the compute line even went up. They moved because the platform made the safe path the easy path: PHI governed by policy instead of by habit, partner collaboration that doesn't move patient data, AI and genomics that run where the data already sits, and all of it on a lake they never had to relocate. The database swap was the least important thing that happened.

If you're staring at a similar AWS-native RWE estate and the friction is governance and collaboration rather than performance, the thing I'd push you to internalise is that you can change the engine, the governance model, and the sharing model without touching the data in S3. Once that clicked for this team, the rest of the decisions more or less made themselves.

🧬 RWE Clinicogenomics on Snowflake — series complete

  1. The Migration, and the TCO Case
  2. Governing PHI, Tokenization, Clean Rooms & the Data Contract App
  3. Cortex AI, Genomics in Snowpark, and the Analytics Surface (this article)