← Back to Blog

Building a Clinico-Genomics RAG on AWS: Architecture, Best Practices, and Hard-Won Lessons

πŸ“š Part 3 of a 3-part series on RAG

  1. RAG From the Ground Up
  2. RAG on AWS: Bedrock Knowledge Bases, GraphRAG & Neptune
  3. Building a Clinico-Genomics RAG on AWS (you are here)

This is where the theory meets a domain that punishes shortcuts. Clinico-genomics β€” connecting a patient's genetic variants to genes, diseases, drugs, and clinical trials to support interpretation β€” is one of the hardest RAG problems in production. The knowledge base is enormous and changes weekly, the relationships matter more than the text, the queries are inherently multi-hop, and a confidently wrong answer can influence a clinical decision. It's also a near-perfect showcase for everything in Parts 1 and 2: hybrid retrieval, GraphRAG, semantic layers, and rigorous evaluation, all running on AWS.

This article walks through the architecture, the design decisions that actually matter, and the lessons that only show up once you've fed real annotation data into a real graph.

Why Genomics Breaks Naive RAG

Start with the scale of the knowledge. A serious variant-interpretation system pulls from a handful of canonical sources, and the volume is staggering:

SourceContentApprox. scale
ClinVarClinically reviewed variant–condition assertions~2.9M variants
gnomADPopulation allele frequencies~183.7M variants
GWAS CatalogGenome-wide association study hits~625K associations
PharmGKBPharmacogenomics (drug response)~41K variants
SnpEffFunctional-effect predictionsAnnotations across ClinVar

That's roughly 190 million variant annotations β€” and it's exactly why RAG, not fine-tuning, is the right tool here. A published study that injected this full corpus via RAG into a GPT-4-class model hit 100% field-level accuracy on its annotation test sets, while fine-tuning a comparable model on a tiny fraction of the data plateaued at 52–95% depending on the field and would have been prohibitively expensive to scale to the full corpus. The conclusion from Part 1 holds with force here: large, changing, citation-requiring knowledge belongs in retrieval.

But raw RAG over these sources still fails, for reasons that are instructive:

  • Exact-token queries: a variant like NM_007294.4:c.5266dupC is a precise string. Dense embeddings smear it into "BRCA1 cancer prose." You need sparse/lexical retrieval (Part 1's hybrid search) or it's a coin flip.
  • The questions are multi-hop: "What therapies and open trials are relevant for a patient with a pathogenic BRCA1 frameshift variant and triple-negative breast cancer?" requires traversing variant β†’ gene β†’ disease β†’ drug β†’ trial. No document states that chain end to end.
  • Reasoning gaps: the study found the model "did not understand that high allele frequency variants tend to be benign." Retrieval injects facts, not clinical reasoning β€” the system design has to encode that logic, not hope the LLM infers it.

The Knowledge Graph Is the Product

The central design decision: model the domain as a knowledge graph, because the relationships are the clinical value. Before writing any retrieval code, you define an ontology β€” the formal vocabulary of entity types and the typed relationships between them. As the semantic-layer literature puts it, an ontology is what lets you infer information that "wasn't explicitly declared," and a knowledge graph is that ontology made queryable.

graph LR
    Patient["Patient\n(de-identified)"] -->|HAS_VARIANT| Variant["Variant\nc.5266dupC"]
    Patient -->|HAS_PHENOTYPE| Phenotype["Phenotype\n(HPO term)"]
    Variant -->|LOCATED_IN| Gene["Gene\nBRCA1"]
    Variant -->|CLASSIFIED_AS| Sig["Significance\nPathogenic (ACMG)"]
    Variant -->|HAS_FREQUENCY| AF["Allele Freq\n(gnomAD)"]
    Gene -->|ASSOCIATED_WITH| Disease["Disease\nHBOC syndrome"]
    Disease -->|TREATED_BY| Drug["Drug\nPARP inhibitor"]
    Drug -->|TARGETS| Gene
    Disease -->|STUDIED_IN| Trial["Clinical Trial\n(NCT id)"]
    Variant -->|AFFECTS_RESPONSE_TO| Drug
          

A clinico-genomics ontology fragment. The power is in the typed edges: a single multi-hop traversal from a Patient's Variant reaches the relevant Gene, Disease, Drug, and Trial β€” a path vector search alone could never assemble.

The lesson that precedes all others β€” garbage in, garbage out. The dominant cause of GenAI failure in these systems is poor metadata and a sloppy ontology, not a weak model. Invest in the schema: typed relationships, controlled vocabularies (HPO for phenotypes, MONDO for diseases, HGVS for variant nomenclature, ACMG for significance), and provenance on every edge. Perfection isn't required β€” start with the high-value entities and grow β€” but the ontology is the foundation everything else stands on.

The AWS Architecture

Here's the full stack, assembled from the building blocks in Part 2 and tuned for a regulated, high-stakes domain.

flowchart TB
    subgraph Ingest["Ingestion & Knowledge Construction"]
        Src["ClinVar Β· gnomAD Β· GWAS\nPharmGKB Β· SnpEff (VCF)"] --> ETL["AWS Glue / Step Functions\nnormalize to HGVS, GRCh38"]
        ETL --> Build["Entity + relationship extraction\n(Bedrock + GraphRAG Toolkit)"]
        Build --> Neptune["Neptune\nKnowledge Graph"]
        ETL --> Embed["Titan Embeddings V2"]
        Embed --> OS["OpenSearch Serverless\nVector index"]
    end

    subgraph Serve["Agentic Serving Layer"]
        UI["Clinician UI"] --> AC["Bedrock AgentCore\nRuntime + Memory"]
        AC -->|tool calls| SL["Semantic Layer\n(LangChain / Strands tools)"]
        SL --> T1["find_variant()"]
        SL --> T2["traverse_to_trials()"]
        SL --> T3["hybrid_search()"]
        T1 --> Neptune
        T2 --> Neptune
        T3 --> OS
        AC --> FM["Bedrock FM\n(Claude β€” grounded answer)"]
    end

    subgraph Gov["Governance"]
        HIL["Human-in-the-loop review"]
        Prov["Provenance / citations"]
        Audit["CloudTrail audit log"]
    end

    FM --> HIL
    FM --> Prov
    AC --> Audit
          

End-to-end clinico-genomics RAG on AWS: an ingestion pipeline builds the Neptune knowledge graph and OpenSearch vector index; an AgentCore-hosted agent answers questions through a semantic layer of tested tools; governance wraps every answer with provenance, audit logging, and human review.

Ingestion: Normalize Before You Build

The sources arrive as VCF and tabular dumps with inconsistent identifiers. Before anything touches the graph, an AWS Glue / Step Functions pipeline normalizes everything to a common reference (GRCh38) and canonical nomenclature (HGVS for variants, dbSNP rsIDs, gene symbols). This is the unglamorous 60% of the project. Skip it and your graph has three nodes for the same variant and your traversals silently miss connections.

Storage: Graph + Vectors, Side by Side

Neptune holds the knowledge graph; OpenSearch Serverless holds the vector index over the free-text annotations and literature. This is the GraphRAG split from Part 2 β€” vector search finds semantic entry points (relevant literature, condition descriptions), then graph traversal expands along the clinically meaningful relationships. For the managed path, Bedrock Knowledge Bases GraphRAG on Neptune Analytics can automate much of the extraction; for a curated clinical ontology you'll usually want more control over the schema than full automation gives.

Serving: An Agent Over a Semantic Layer

This is the most important architectural choice for safety. The LLM never writes raw openCypher against the clinical graph. Instead, Amazon Bedrock AgentCore hosts the agent (built with LangChain/LangGraph or Strands), and its Gateway exposes a curated semantic layer of tested tools. The agent decides which tool to call with which arguments; the tools encapsulate the actual queries.

from pydantic import BaseModel, Field
from langchain.tools import tool

class VariantLookup(BaseModel):
    hgvs: str = Field(description="HGVS notation, e.g. NM_007294.4:c.5266dupC")

@tool(args_schema=VariantLookup)
def find_variant(hgvs: str) -> dict:
    """Look up a variant by HGVS notation. Returns gene, ACMG
    significance, gnomAD allele frequency, and provenance. Uses an
    exact full-text match β€” the right tool for precise variant strings."""
    # Deterministic, parameterized openCypher β€” never LLM-generated
    return neptune.query(
        "MATCH (v:Variant {hgvs: $hgvs})-[:LOCATED_IN]->(g:Gene) "
        "OPTIONAL MATCH (v)-[:CLASSIFIED_AS]->(s:Significance) "
        "OPTIONAL MATCH (v)-[:HAS_FREQUENCY]->(f:AlleleFreq) "
        "RETURN v, g, s, f, v.source AS provenance",
        {"hgvs": hgvs},
    )

@tool
def traverse_to_trials(gene_symbol: str, phenotype_hpo: str) -> list:
    """Find open clinical trials reachable from a gene and phenotype
    via disease associations. Encapsulates the multi-hop traversal so
    the agent never has to construct it."""
    return neptune.query(
        "MATCH (g:Gene {symbol:$g})-[:ASSOCIATED_WITH]->(d:Disease) "
        "MATCH (p:Phenotype {hpo:$p})<-[:HAS_PHENOTYPE]-(:Patient)"
        "-[:HAS_VARIANT]->(:Variant)-[:LOCATED_IN]->(g) "
        "MATCH (d)-[:STUDIED_IN]->(t:Trial {status:'Recruiting'}) "
        "RETURN DISTINCT t, d, t.nct_id AS citation",
        {"g": gene_symbol, "p": phenotype_hpo},
    )

Every tool returns provenance β€” the source database and record ID β€” so the final answer can cite where each fact came from. This turns the brittle "hope the model writes correct Cypher" approach into tested code that "works every time exactly as scripted," and it means a malformed query can never reach the database. AgentCore Memory carries conversational state across turns so a clinician can refine a question without re-specifying the patient context.

Best Practices That Earned Their Place

1. Hybrid Retrieval Is Non-Negotiable

Variant identifiers, gene symbols, rsIDs, and NCT trial numbers are exact tokens. Pure semantic search loses them. Run BM25 alongside vector search and fuse with RRF (Part 1), and route precise-identifier lookups through the exact-match graph tool rather than vector search at all. The biggest single accuracy regression we saw came from a well-meaning "just use embeddings, they're smarter" simplification.

2. Encode Clinical Logic β€” Don't Hope the Model Infers It

The "high allele frequency β†’ likely benign" relationship is a known failure mode: the model won't reliably infer it from retrieved facts. Encode it. Either compute it as a property at ingestion (flag variants above a population-frequency threshold) or build it into a tool that the agent calls. The same goes for ACMG classification rules. Retrieval supplies facts; your code supplies the reasoning that clinical correctness depends on.

3. Provenance and Explainability Are Features, Not Logging

Every claim in an answer must trace to a source: "Pathogenic per ClinVar (RCV000031121), reviewed by expert panel." The graph traversal path is itself an explanation β€” show it. This is a major reason GraphRAG suits this domain over opaque vector RAG: the inspectable path from variant to conclusion is exactly what a clinician needs to trust (or challenge) the output.

4. Human-in-the-Loop, Always

This system is decision support, never decision making. The output is a structured, cited briefing that a geneticist or molecular tumor board reviews. Design the UI around augmenting the expert β€” surfacing the evidence and the path β€” not replacing them.

Compliance is architecture, not an afterthought. Patient genomic data is among the most sensitive PHI there is. De-identify before ingestion where possible; keep identifiable data in a HIPAA-eligible boundary. Bedrock, Neptune, OpenSearch, and AgentCore are all HIPAA-eligible, but eligibility is a starting point, not compliance β€” you still owe a BAA, encryption at rest and in transit, least-privilege IAM, VPC isolation, and CloudTrail audit logging of every query. Some institutions require fully on-premises or single-tenant deployment for clinical decision support; design for that constraint up front rather than retrofitting it.

5. Evaluate Relentlessly β€” and Separately

Build a gold-standard evaluation set with geneticists: questions paired with correct, cited answers. Measure retrieval (did the traversal reach the right trials and assertions?) separately from generation (is the answer faithful to retrieved evidence, with no fabricated citations?). In this domain, faithfulness and citation accuracy outrank fluency β€” a beautifully written answer with one hallucinated trial ID is worse than useless. Track hallucination rate as a first-class metric and gate releases on it.

Lessons Learned, Condensed

  • The ontology is 60% of the value and most of the effort. Normalization and a clean schema beat clever retrieval every time. Spend the time here.
  • GraphRAG earns its cost here. Multi-hop clinical questions are the rare case where graph traversal isn't over-engineering β€” it's the only thing that answers the question, and it brings explainability for free.
  • The semantic layer is what makes it safe. Tested, parameterized tools over LLM-generated queries is the difference between a demo and something a hospital will let near a tumor board.
  • RAG injects facts; your code supplies reasoning. Don't expect the model to infer clinical rules. Encode them at ingestion or in tools.
  • Keyword search limitations are real. The cited genomics study flagged that pure keyword retrieval misses documents when query phrasing varies β€” hybrid search and the GraphRAG layer exist precisely to close that gap.
  • Compliance shapes the architecture from day one. Retrofitting HIPAA boundaries onto a finished system is a rebuild, not a patch.

Closing the Series

Across three articles we went from first principles β€” RAG is a search problem, and hybrid search plus reranking is where the wins live β€” through the full AWS toolkit of managed and DIY RAG on Bedrock and Neptune, and finally into a domain where every one of those techniques is load-bearing. The throughline: retrieval quality determines RAG quality, relationships need graphs, and in high-stakes domains the engineering around the model β€” the ontology, the semantic layer, the provenance, the human in the loop β€” matters more than the model itself. That's the part the demos never show, and the part that actually ships.

πŸ“š The complete series

  1. ← RAG From the Ground Up
  2. ← RAG on AWS: Bedrock, GraphRAG & Neptune
  3. Building a Clinico-Genomics RAG on AWS (this article)