RWE Clinicogenomics: Migrating from AWS to Snowflake (and the TCO Case)

🧬 This is Part 1 of a 3-part series: RWE Clinicogenomics on Snowflake

The Migration, and the TCO Case (you are here)
Governing PHI, Tokenization, Clean Rooms & the Data Contract App
Cortex AI, Genomics in Snowpark, and the Analytics Surface

Most of the "should we move off AWS?" conversations I sit in don't really start with AWS. They start with a problem the team has been working around for a year, and the cloud is just where the problem happens to live. For the real-world evidence group I spent a good chunk of last year with, the problem was collaboration. They had a genuinely good clinicogenomics dataset — linked clinical records and genomic results across a few hundred thousand patients — and almost no safe way to let a pharma partner touch it without a six-week legal and security exercise per request.

This is a re-platforming story, but the interesting part isn't the database swap. It's that we moved the analytics and the collaboration layer onto Snowflake while leaving every byte of patient data exactly where it already was: in S3. No lift-and-shift of the lake, no "migration weekend" for petabytes of genomic output. This first part covers why they moved, how the migration actually worked, and — the bit everyone asks me about afterwards — whether it saved money. I'll deal with the governance and collaboration build in Part 2, and the analytics and genomics in Part 3.

What they had, and why it hurt

The starting estate was a fairly textbook AWS analytics stack, and I want to be fair to it — none of this was broken, exactly. S3 held the lake: clinical data as Parquet (observations, conditions, medications, encounters in a FHIR-ish shape) plus the genomics outputs — annotated variants and a pile of derived feature tables that the bioinformatics pipeline produced from raw VCFs. Redshift was the warehouse where the epidemiologists built cohorts. Streaming lab and device feeds came in through MSK (managed Kafka). Amazon DataZone had been adopted the year before to catalog and govern things and to broker data sharing. QuickSight ran the dashboards. EMR and a bit of SageMaker did the heavy genomic processing and the occasional model.

Four things kept coming up in retros:

Sharing patient-level data was painful and scary. Every external collaboration meant carving out a de-identified extract, shipping it somewhere, and trusting a contract. The team hated it, the lawyers hated it, and it didn't scale past a couple of partners.
Governing PHI was a manual habit, not a control. Masking lived in views and in tribal knowledge about which columns were sensitive. New tables showed up without anyone classifying them. Nobody could answer "where is the MRN, across everything?" in under a day.
The AI work meant moving data out. Running anything language-model-shaped over clinical notes meant exporting text to a separate environment, which for PHI is precisely the thing you don't want to be doing.
Cohort builds and dashboards fought each other. A big retrospective cohort job during business hours and the QuickSight refreshes would contend, and someone would complain.

None of these is a reason to panic. Together they're a reason to look at whether a different platform makes the default path the safe, easy one instead of the heroic one.

The one rule: the data stays on S3

The constraint that shaped everything was non-negotiable from the start, and honestly it's the right instinct: we were not going to copy the patient data into a new proprietary store. Compliance didn't want a second copy of PHI to reason about, finance didn't want to pay egress to re-ingest a lake, and nobody wanted a cutover that hinged on moving genomics-scale data without dropping a row.

The regulatory scope sat behind all of this, and it's worth naming up front because it constrained the design more than any performance target. This is PHI plus genomic data, so HIPAA applies in full and — if any of the evidence ever supports a regulatory submission — so do the FDA's real-world-evidence expectations and 21 CFR Part 11. The unglamorous first consequence: you cannot run PHI on a cloud platform without a Business Associate Agreement, so BAAs with both Snowflake and AWS were a precondition, not a procurement afterthought. And for a clinicogenomic dataset the de-identification story is genuinely harder than stripping identifiers — I deal with the whole compliance frame, and why a genome can't simply be "de-identified," in Part 2.

That rule used to be the thing that kept teams on Redshift. As of last June it stopped being a blocker, because Snowflake's Apache Iceberg tables went GA — which means Snowflake can run as the query engine and catalog over Parquet that lives in your S3 bucket, as a single copy, in an open format other engines can still read. That last part mattered: the bioinformatics team's Spark jobs weren't going anywhere, and they needed to keep reading the same tables.

So the mental model became: keep the lake, change the engine. Snowflake reads and writes the clinical and genomic tables as Snowflake-managed Iceberg tables on the existing S3 layout. Creating one is unremarkable, which is the point:

-- An external volume points at the existing S3 location; no data is moved.
CREATE OR REPLACE ICEBERG TABLE rwe.clinical.observation
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 's3_rwe_clinical'
  BASE_LOCATION = 'clinical/observation/';

-- Genomic feature tables come across the same way — still Parquet, still in S3,
-- now queryable from Snowflake with pruning and the rest of the engine on top.

For a handful of legacy raw drops that we only ever needed to read occasionally, plain external tables were fine. Everything that was queried for real became managed Iceberg, because the performance is close enough to native that the epidemiologists didn't notice they'd left Redshift — which is the highest compliment a migration like this gets.

The before-and-after, honestly

graph LR
    subgraph AWS["Before — AWS-native RWE stack"]
        S3a["S3 lake\nclinical Parquet + genomic outputs"]
        RS["Amazon Redshift\nwarehouse / cohort builds"]
        DZ["DataZone + Glue\ncatalog · governance · sharing"]
        QS["QuickSight\ndashboards"]
        MSK["MSK (Kafka)\nlab + device feeds"]
        EMR["EMR / SageMaker\ngenomics + models"]
    end
    subgraph SF["After — Snowflake (data still on S3)"]
        ICE["Iceberg tables on S3\nsingle copy, no data move"]
        SFW["Snowflake\nwarehouses · Snowpark · Notebooks"]
        GOV["Classification + tags\nmasking · row access · tokenization"]
        SHARE["Secure Data Sharing\n+ Data Clean Rooms"]
        PBI["Power BI"]
    end
    S3a --> ICE
    RS --> SFW
    EMR --> SFW
    DZ --> GOV
    DZ --> SHARE
    QS --> PBI
    MSK --> SFW

What got replaced by what. The lake on S3 stays put (now read as Iceberg). Redshift and downstream analytics/ML consolidate onto Snowflake and Snowpark; CWL/WDL/Nextflow bioinformatics pipeline execution stayed on AWS but moved from ad-hoc EMR clusters to AWS HealthOmics (not shown above — it is still an AWS service, just a more efficient one). DataZone's catalog/governance/sharing role splits into Snowflake's classification and policy features plus Secure Data Sharing and Data Clean Rooms (Part 2); QuickSight gives way to Power BI (Part 3), which the wider organisation had already standardised on.

The streaming side was the least dramatic part. The Kafka topics carrying lab results and device telemetry repointed at Snowflake through the Kafka connector running Snowpipe Streaming, landing into tables that the same downstream models read. We ran MSK and Snowflake in parallel for a few weeks, compared row counts and late-arrival behaviour, and turned off the old path once the numbers matched.

I won't pretend the SQL was free. Redshift dialect quirks, a few SUPER columns full of semi-structured JSON that needed re-thinking as VARIANT, and the usual storage-integration and IAM dance to let Snowflake's external volume touch the bucket — all of it took longer than the optimistic estimate. The cohort logic itself ported cleanly. The annoying 20% was, as always, the edges.

How we actually sequenced it

Nobody flips a regulated platform over a weekend, so we didn't. The order that worked: stand up the external volume and register the existing S3 tables as Iceberg first, so Snowflake could read production data read-only without anything changing for existing consumers. Then port the cohort SQL and validate it against Redshift output query-by-query — same inputs, same row counts, same aggregates — until the epidemiologists trusted it. Then repoint the Kafka streams in parallel. Governance and sharing came next (Part 2), and only then did we move the dashboards and turn Redshift off. Dual-running cost us a couple of months of paying for both, which I'd budget for explicitly rather than pretend away.

So did it save money?

This is the question that comes up in every steering meeting, and the honest answer is more interesting than a yes. Let me show you the shape of it, then the caveats, because a TCO table with no caveats is a sales deck, not an architecture decision.

These numbers are illustrative and anonymised — directional figures from a platform in the low-hundreds-of-thousands-of-patients range, annualised. Your mileage will vary by an order of magnitude depending on data volume and how heavily you'd reserved capacity. Treat the relationships as the takeaway, not the absolute dollars.

Cost area	Before (AWS-native)	After (Snowflake, data on S3)	Direction
Object storage (S3)	~$40k	~$40k (unchanged — same bucket)	flat
Warehouse / query compute	~$260k (Redshift RA3, reserved)	~$300k (Snowflake credits, consumption)	up
Genomics compute †	~$180k (EMR + SageMaker clusters)	~$120k (HealthOmics pipelines ~$80k + Snowpark analytics/ML ~$40k)	down
BI	~$30k (QuickSight)	~$30k (Power BI, already org-standard)	flat
Governance / catalog	DataZone + Glue (low licence, high effort)	included in Snowflake	down
Data movement / egress	~$40k (extracts, cross-service)	~$5k (sharing in place)	down
Platform engineering (FTE)	~2.5 FTE (~$450k loaded)	~1 FTE (~$180k loaded)	way down

† Genomics compute is a partial migration. CWL/WDL/Nextflow bioinformatics pipelines (alignment with BWA, variant calling with GATK, annotation) cannot run inside Snowflake — Snowpark has no concept of these workflow runtimes. That pipeline execution moved from ad-hoc EMR clusters to AWS HealthOmics, which runs WDL, Nextflow, and CWL natively and bills per workflow rather than per idle cluster-hour. Only the downstream work — post-pipeline feature engineering, polygenic risk scoring, ML over derived variant tables — moved to Snowpark. The cost dropped mainly because HealthOmics eliminated the idle-cluster waste, not because the full genomics stack consolidated onto Snowflake.

Look at the compute row first, because it's the one that surprises people: the headline warehouse spend went up. Reserved Redshift, fully utilised, is genuinely cheap per query-hour, and Snowflake's consumption model means an unattended warehouse or a sloppy clustering choice shows up on the invoice. If you only compare the database line items, Snowflake can look like a worse deal, and I've watched finance teams stop reading right there.

The story is in the other rows. The genomics compute line went down, but not because the whole stack moved to Snowflake — it's worth being precise about that. The primary bioinformatics pipelines (alignment, variant calling, annotation) run as CWL/WDL/Nextflow workflows; Snowpark has no runtime for those and never will. That work migrated from ad-hoc EMR clusters to AWS HealthOmics, which runs the same workflow definitions natively and bills per workflow run rather than per cluster-hour, eliminating most of the idle-time waste. Only the downstream work — feature engineering over derived variant tables, polygenic risk scoring, ML — moved to Snowpark. So the cost came down from the HealthOmics switch, not from replacing everything with Snowflake compute. The SageMaker models that lived on top of clinical features did consolidate fully into Snowpark + Snowflake ML, which contributed the smaller slice. Sharing in place collapsed the egress and the cottage industry of extract pipelines. And the big one — the line that doesn't appear on any cloud invoice — is people. Running Redshift plus EMR plus DataZone plus Glue plus the extract machinery was roughly two-and-a-half engineers' worth of keep-the-lights-on. On the consolidated platform that dropped to about one. That single row dwarfs the compute delta.

The line that mattered most wasn't on the cloud bill at all. The credit spend went up; total cost of ownership went down, and most of the saving was reduced operational toil and the elimination of duplicate data copies and extract pipelines — none of which shows up if you only diff the Redshift and Snowflake invoices. If your TCO case lives or dies on the compute line, you're measuring the wrong thing.

There's also a value side I won't try to put in the table because it's too situational: partner collaborations that used to take six weeks of legal and security work now take days (Part 2), which for a team whose business is data collaboration is worth more than the infrastructure savings combined. I flag it in steering decks as a benefit, not a number, and let the sponsors weigh it.

One last caveat on the consumption model, because it bites the unwary: Snowflake will cheerfully let you spend. Auto-suspend on every warehouse, separate warehouses per workload so a runaway cohort job can't drag the dashboards down, and resource monitors with hard caps were not optional — they were how we kept that ~$300k from quietly becoming $450k in the first quarter. Set them up on day one.

Where this leaves Part 1

The platform swap itself is almost anticlimactic once you accept the central trick: with Iceberg, you change the engine without moving the data, so the migration is a metadata-and-SQL exercise rather than a petabyte relocation. The TCO came out ahead, but for reasons that have little to do with the database and a lot to do with consolidation and operational toil.

What actually justified the move — governing PHI as an enforced control, and sharing data with partners without ever handing over a row — is the subject of Part 2, including the data contract management app we built to keep all of it honest. Part 3 then gets into the genomics work in Snowpark and the analytics surface in Notebooks and Power BI.

🧬 Continue the series

The Migration, and the TCO Case (this article)
Governing PHI, Tokenization, Clean Rooms & the Data Contract App →
Cortex AI, Genomics in Snowpark, and the Analytics Surface →