โ† Back to Blog

The Evidence Layer in Healthcare & Biotech AI: HIPAA, 21 CFR Part 11, GxP, GMLP

๐Ÿ“š This is Part 3 of a 3-part series: Auditable AI in Regulated Industries

  1. The Evidence Layer in Banking โ€” BCBS 239, CCAR, SOX
  2. Designing Multi-Agent AI Over Sensitive Data: Traceable by Construction
  3. The Evidence Layer in Healthcare & Biotech AI (you are here)

In banking, an unprovable number costs money and reputation. In healthcare and biotech, an unprovable AI output can cost a life โ€” or invalidate a drug trial that took a decade and a billion dollars. The regulators are different, the acronyms are denser, and the stakes are higher, but the demand from Part 1 is word-for-word identical: prove how you got this result, and prove nothing was silently altered along the way.

This final article maps the life-sciences regulatory stack onto the same evidence-layer thinking, and shows how the traceable-by-construction architecture from Part 2 lands almost component-for-component in a clinical or genomics setting. If you read Part 1, the structure will feel familiar โ€” that's the point. The evidence layer is one idea wearing different regulators' badges.

The framing that carries over: just as lineage is the evidence layer for BCBS 239, CCAR, and SOX, the audit trail is the evidence layer for HIPAA, 21 CFR Part 11, and GxP. When an auditor asks "why was this AI prediction trusted in a regulated decision?", the answer is a retrievable record: the data lineage, the model/code version, the validation results, and the human approvals. Build that and you've answered the whole alphabet of life-sciences regulators.

A Layered Regulatory Stack

Healthcare AI rarely faces one regulation โ€” it faces a stack, and which layers apply depends on what the system does and where it operates. An engineer should know the shape of all of them.

LayerGovernsWhat the evidence layer must show
HIPAA / HITECH (US)Privacy & security of protected health information (PHI)Who accessed which patient data, when, and that access was authorized and minimal
FDA SaMD + GMLPAI/ML as a medical device โ€” safety & effectiveness across the product lifecycleRepresentative training data, validation, human oversight, post-market monitoring
21 CFR Part 11 / EU Annex 11Electronic records & signatures in FDA/EMA-regulated systemsAn audit trail capturing who/what/when/why for every create, modify, or delete
GxP + GAMP 5Good practice (clinical/manufacturing/lab) + computer-system validationThe system is validated for its risk and intended use; data integrity is preserved (ALCOA+)
GDPR / EHDS (EU)Personal & health data protection; European Health Data SpaceLawful basis, data-subject rights, and accountability over health data
EU AI ActMost clinical AI is "high-risk" โ†’ logging & traceability (Part 2)Automatic event logs, lifecycle traceability, human oversight

The layers overlap, and that's good news: a single well-designed evidence layer satisfies the common core across all of them, exactly as it did for the three finance regimes in Part 1.

ALCOA+: The Data-Integrity Bedrock

Where finance has "accurate and traceable," life sciences has a more explicit creed: ALCOA+. Any data supporting a regulated decision must be:

PrincipleMeaning
AttributableYou know who (or which system/agent) created or changed it
LegibleReadable and permanent
ContemporaneousRecorded at the time the activity happened
OriginalThe first capture (or a verified true copy)
AccurateCorrect and error-free
+ Complete, Consistent, Enduring, AvailableNothing dropped, no contradictions, survives over time, retrievable on demand

Read those again with an AI pipeline in mind. "Attributable" means an agent needs an identity (Part 2, principle 1). "Contemporaneous" means logging happens as the action occurs, not reconstructed later. "Enduring" and "Available" mean immutable, long-lived, queryable storage. ALCOA+ is, almost line for line, a specification for the audit store we designed in Part 2 โ€” written by pharma regulators decades ago.

21 CFR Part 11: Audit Trails for AI

21 CFR Part 11 (and its EU counterpart, Annex 11) governs electronic records and signatures in FDA-regulated work โ€” clinical trials, manufacturing, lab systems. Its central demand is the audit trail: a secure, computer-generated, time-stamped record that captures who, what, when, and why for every creation, modification, or deletion of regulated data. Crucially, it must be independent of the operator โ€” you can't be able to edit your own audit trail.

For AI, this has a sharp edge. If an AI system cleans, imputes, or transforms data in a regulated dataset, every one of those changes is a Part 11 event. The compliant pattern: any data cleaning an AI performs must be either reversible or, at minimum, transparently logged โ€” what was changed, from what to what, by which model version, and why. A model that silently "fixes" a value with no audit entry has just corrupted a regulated record, however good its intentions.

The question that defines the whole design: an inspector points at an AI-influenced decision and asks, "Why did you trust this output?" A compliant organization retrieves, in minutes, a single linked record: the data lineage (which sources, traced to origin), the code/model version that produced it, the validation records showing the model was fit for this use, and the human approvals in the loop. If any of those four is missing or can't be tied to this specific output, you don't have an answer โ€” you have an observation in the inspection report.

GMLP: Governing the Model Across Its Lifecycle

For AI/ML that functions as a medical device, the FDA โ€” together with Health Canada and the UK's MHRA, and now harmonized through an IMDRF guiding document finalized in January 2025 โ€” defines Good Machine Learning Practice (GMLP): ten guiding principles covering the total product lifecycle. The ones with the most direct architectural consequences:

  • Representative data. Training and test data must reflect the intended patient population โ€” and you must be able to show that, which means data lineage on your training sets, not just your inference inputs.
  • Robust engineering & data integrity. The same software-quality and data-quality discipline as any safety-critical system.
  • Human oversight. The clinician stays in the loop; the human-in-the-loop gates from Part 2 are mandatory here, not optional.
  • Lifecycle monitoring (Principle 10). Deployed models are monitored for performance drift, and re-training risk is actively managed โ€” observability isn't a launch metric, it's a perpetual obligation.

The FDA's January 2025 draft guidance pushes this further with a seven-step credibility assessment anchored to the model's context of use and risk โ€” the same risk-based logic as GAMP 5 (below). And for models that learn after deployment, a Predetermined Change Control Plan (PCCP) lets you specify, in advance, what changes are permitted without a new submission โ€” which only works if you can prove the model stayed within that envelope. That proof is, again, the evidence layer.

GAMP 5: Risk-Based Validation

GxP systems must be validated โ€” demonstrated to do what they're supposed to, reliably. GAMP 5 is the industry's risk-based framework for that computer-system validation: scale the rigor to the risk and the intended use. Its modern guidance (including the ISPE GAMP RDI appendix on AI/ML data integrity, 2024) extends the same thinking to AI: assess the model's risk, validate proportionally, and โ€” the recurring theme โ€” use data-lineage tools and version control for datasets and code so the validated state is reconstructable.

The convergence across FDA credibility framing, GMLP, GAMP 5, and ALCOA+ is striking: they all land on the same checklist โ€” risk-based rigor, representative and traceable data, human oversight, and continuous monitoring with an audit trail underneath it all.

The Architecture, in a Clinical Setting

Now watch the Part 2 architecture map onto a clinico-genomics AI โ€” the kind of system that interprets a patient's variants against knowledge bases and drafts a clinician-facing summary. (This builds directly on the clinico-genomics RAG architecture from the earlier AWS series.) The components barely change; only the regulators' names do.

flowchart TB
    subgraph Train["Model lifecycle (GMLP ยท GAMP 5)"]
        TD["Training data\n(lineage to source: ClinVar, gnomADโ€ฆ)"]
        MV["Model version + validation records"]
        TD --> MV
    end

    subgraph Serve["Inference (HIPAA ยท 21 CFR Part 11)"]
        Q["Clinician query"] --> ID["Agent identity + purpose"]
        ID --> GW["๐Ÿ” Governed gateway\nde-identify PHI ยท policy ยท log"]
        GW --> KB["Knowledge graph + records\n(row/col security)"]
        GW --> GEN["Model: grounded answer\n+ citations + provenance"]
        GEN --> HITL["๐Ÿ‘ฉโ€โš•๏ธ Clinician review & sign-off"]
    end

    AUDIT["๐Ÿงพ ALCOA+ audit trail\nattributable ยท contemporaneous ยท enduring ยท immutable"]

    MV -.pinned to each output.-> AUDIT
    GW -.every access.-> AUDIT
    HITL -.approval event.-> AUDIT
    HITL --> OUT["Released result\n(traceable end-to-end)"]
          

The Part 2 pattern in a clinical setting. PHI is de-identified at the gateway; the model version is pinned to every output; the clinician sign-off is an audited event; training data carries lineage to source. The released result answers "why did you trust this?" by construction.

The mapping is almost mechanical:

Part 2 principleHealthcare instantiation
Agent identityALCOA+ "Attributable" โ€” every action tied to an identity
Govern at the data layerHIPAA minimum-necessary access; row/column security on PHI
Governed gateway + redactionDe-identification/tokenization of PHI before it enters the model context
Decision traceThe "why did you trust this output" record: lineage + version + validation
Immutable audit log21 CFR Part 11 audit trail; ALCOA+ "Enduring/Available"
Human-in-the-loop gatesGMLP human oversight; clinician sign-off as an audited event

What Changes vs. Finance โ€” and What Doesn't

Two things are genuinely harder in healthcare. First, training-data provenance is in scope, not just runtime data: GMLP and GAMP 5 want lineage on the datasets your model learned from, so you can show the population was representative and the data wasn't contaminated. Finance cares about model inputs; life sciences also cares about the model's upbringing. Second, human oversight is mandatory by regulation, not just prudent โ€” a clinician must stay in the loop, and that loop must be evidenced.

What doesn't change is the spine. Identity, data-layer governance, a mediated and logged access path, decision provenance, an immutable audit trail, and lineage that treats the model as a node rather than a gap โ€” that architecture is invariant across banking and biotech. The regulators wrote their rules independently, decades apart, and converged on the same answer because there is only one good answer to "prove it."

The closing principle for all three industries: compliance is not a document you write after the system works โ€” it's a property you build into how the system works. Encode the rules as architecture (the data-layer policy, the gateway, the immutable log, the pinned versions), and the evidence is produced automatically as a byproduct of normal operation. The organizations that treat the evidence layer as core infrastructure ship AI into regulated environments. The ones that treat it as paperwork get stuck in pilot purgatory โ€” or worse, in an inspection finding.

Series Wrap-Up

Across three articles, one idea: regulated industries don't run on trust, they run on evidence, and the evidence layer โ€” lineage in finance, the audit trail in life sciences, observability in agentic AI โ€” is the infrastructure that makes obligations provable. BCBS 239, CCAR, SOX, HIPAA, 21 CFR Part 11, GxP, GMLP, the EU AI Act: different badges, one demand. Design your AI and data systems so that "show me how you got this, and prove nothing was silently altered" always has an answer, and you can build in the most regulated environments on earth. Skip it, and no amount of model quality will save you.