RAG From the Ground Up: Types, Architecture, and What Actually Moves the Needle

📚 This is Part 1 of a 3-part series on RAG

RAG From the Ground Up (you are here)
RAG on AWS: Bedrock Knowledge Bases, GraphRAG & Neptune
Building a Clinico-Genomics RAG on AWS

Retrieval-Augmented Generation has gone through the full hype cycle in about three years: from "magic trick that makes ChatGPT cite your PDFs" to "the most over-promised and under-engineered pattern in applied AI" to, finally, a mature discipline with real engineering practices. The gap between a RAG demo and a RAG system that survives contact with production users is enormous, and most of that gap is in the retrieval half — the part everyone ignores because the generation half is the fun part.

This article is the foundation. We'll cover what RAG actually is (and isn't), the taxonomy of RAG architectures from naive to agentic, and — most importantly — the specific techniques that move quality metrics versus the ones that just sound impressive in a design doc. Parts 2 and 3 take this onto AWS and into a real clinical-genomics build.

What RAG Is, Stripped to the Core

RAG is a deceptively simple idea: instead of relying on a model's parametric memory (what it learned during training), you retrieve relevant information at inference time and inject it into the prompt as context. The model then generates an answer grounded in that retrieved context rather than its frozen training knowledge.

That's it. Everything else — embeddings, vector databases, rerankers, chunking strategies — is implementation detail in service of one goal: get the right context in front of the model at the right time. When people say "RAG is failing," they almost never mean the generation failed. They mean retrieval surfaced the wrong chunks, or the right chunks in the wrong order, or no relevant chunks at all.

The mental model that fixes most RAG bugs: RAG is a search problem with a language model stapled to the end. If your search is bad, no amount of prompt engineering on the generation side will save you. Debug retrieval first, always. Measure what fraction of your retrieved contexts actually contain the answer before you touch the prompt.

RAG vs Fine-Tuning: The Question Everyone Asks Wrong

The framing "should I use RAG or fine-tuning?" is misleading because they solve different problems. They're not competitors; they're complementary tools.

Dimension	RAG	Fine-Tuning
Best for	Injecting factual, changing knowledge	Teaching behavior, format, tone, domain patterns
Knowledge freshness	Update the index, instant	Retrain to update — slow and costly
Knowledge volume	Scales to hundreds of millions of facts	Limited by what fits in training without overfitting
Provenance / citations	✅ Natural — you know which doc was retrieved	❌ The model can't cite what it absorbed into weights
Cost to add knowledge	Storage + embedding cost (cheap)	Training compute (expensive, scales poorly)
Hallucination control	Strong (grounding in retrieved text)	Weaker (model still generates from memory)

The empirical pattern across domains is consistent: RAG wins for knowledge injection, fine-tuning wins for behavior and pattern learning. A real-world genomics study (which we'll revisit in Part 3) injected ~190 million variant annotations via RAG and hit 100% field accuracy, while fine-tuning the same model on a tiny fraction of that data struggled to exceed 50–95% depending on the field — and scaling fine-tuning to the full corpus would have been prohibitively expensive. The lesson generalizes: if the knowledge is large, changing, or needs citations, reach for RAG. If you need the model to behave differently (structured output, a house style, a reasoning pattern), fine-tune. Often you want both.

The RAG Maturity Ladder: Naive → Advanced → Modular → Agentic

RAG architectures fall on a spectrum of sophistication. Understanding where you are on this ladder tells you what your next improvement should be.

flowchart TB
    subgraph Naive["1 · Naive RAG"]
        N1["Chunk → Embed → Store"]
        N2["Query → Top-k vector search"]
        N3["Stuff context → Generate"]
        N1 --> N2 --> N3
    end

    subgraph Advanced["2 · Advanced RAG"]
        A1["Query rewriting / expansion"]
        A2["Hybrid search (BM25 + vector)"]
        A3["Rerank wide candidate set"]
        A4["Generate with curated context"]
        A1 --> A2 --> A3 --> A4
    end

    subgraph Modular["3 · Modular RAG"]
        M1["Routing across sources"]
        M2["Query decomposition"]
        M3["Iterative retrieve-read loops"]
    end

    subgraph Agentic["4 · Agentic RAG"]
        G1["Agent decides WHEN to retrieve"]
        G2["Tool calls: search, SQL, graph"]
        G3["Self-reflection & re-retrieval"]
    end

    Naive --> Advanced --> Modular --> Agentic

The RAG maturity ladder. Most teams ship Naive RAG, plateau on quality, and discover the wins are all in the Advanced tier — hybrid search and reranking — before they ever need Agentic complexity.

Naive RAG

The tutorial version: split documents into fixed-size chunks, embed them, store vectors, then for each query do a top-k cosine similarity search and stuff the results into the prompt. It works for demos and simple FAQ bots. It breaks the moment your corpus has nuance, your queries are phrased differently than your documents, or the answer requires combining information from multiple chunks.

Advanced RAG

This is where 80% of the real-world quality gains live, and most teams skip straight past it. Advanced RAG adds pre-retrieval steps (query rewriting, expansion), better retrieval (hybrid search), and post-retrieval steps (reranking, compression). We'll spend the bulk of this article here because it's the highest ROI.

Modular RAG

Treats retrieval as composable modules: a router that sends different query types to different sources (vector store for prose, SQL for metrics, graph for relationships), query decomposition that splits a complex question into sub-questions, and iterative loops that retrieve, read, and retrieve again based on what was found.

Agentic RAG

An LLM agent decides whether and what to retrieve, calls retrieval as a tool alongside other tools (SQL queries, graph traversals, web search), evaluates whether the retrieved context is sufficient, and re-retrieves if not. This is powerful and expensive — every reflection step is another model call. Use it when queries are genuinely open-ended and multi-step, not because it's the newest pattern.

The Taxonomy of Retrieval Types

"RAG" hides a dozen distinct retrieval strategies. Knowing which one fits your data is half the battle.

Type	How it retrieves	Best for
Dense	Embedding similarity (semantic)	Paraphrased queries, conceptual matches
Sparse	Keyword / BM25 (lexical)	Exact terms, codes, names, acronyms
Hybrid	Dense + sparse fused (RRF)	Almost everything — the safe default
Graph (GraphRAG)	Vector entry + graph traversal	Multi-hop, relationship-driven questions
Multimodal	Text + image/table vector indices	Documents with diagrams, scans, tables
Self-RAG	Model critiques & re-retrieves	High-stakes accuracy, hallucination control
Adaptive	Retrieve only when needed	Mixed query loads, latency/cost sensitivity

The most important insight here: dense (semantic) search alone has a blind spot for exact tokens. Ask a dense retriever about "the BRCA1 c.5266dupC variant" and it may return semantically related cancer-genetics prose while missing the document that contains that exact variant string. Sparse/BM25 nails exact tokens but misses paraphrases. Hybrid search exists precisely because real queries need both.

What Actually Moves the Needle

Here's the honest ranking of techniques by impact-per-effort, based on what consistently shows up in production post-mortems and benchmarks.

1. Chunking — The Single Biggest Lever

Chunking is the most underrated decision in RAG. Chunk too large and you dilute the embedding (one vector trying to represent five topics) and waste context window. Chunk too small and you fragment ideas across boundaries so no single chunk contains a complete answer.

Fixed-size / sentence-aware: The default. Split on ~256–512 tokens with overlap (10–20%). Boring, robust, fine for most prose.
Semantic chunking: Split where the topic shifts (detected by embedding distance between adjacent sentences). Pays off for dense technical docs and manuals where topic boundaries are messy.
Document-aware / structural: Respect headings, tables, code blocks, and lists. Essential for structured content — never split a table across chunks.

The practical chunking recipe: Start with sentence-aware fixed-size chunks (~400 tokens, 50-token overlap) plus structural awareness so you never break tables or code blocks. Only reach for semantic or LLM-based chunking when you can measure that boundaries are hurting retrieval. Don't pre-optimize chunking before you have an evaluation set — you'll just be guessing.

2. Hybrid Search — Catch Both Meaning and Exact Tokens

Combine BM25 (lexical) with vector (semantic) search and fuse the results with Reciprocal Rank Fusion (RRF). RRF is elegant: it ignores the raw scores (which aren't comparable across methods) and uses only rank position, so a document ranked highly by either method floats to the top.

# Reciprocal Rank Fusion — combine two ranked lists
def reciprocal_rank_fusion(ranked_lists, k=60):
    scores = {}
    for ranked in ranked_lists:           # e.g. [bm25_results, vector_results]
        for rank, doc_id in enumerate(ranked):
            scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = reciprocal_rank_fusion([bm25_hits, vector_hits])

3. Reranking — "Retrieve Wide, Rerank Narrow"

This is the highest-leverage single change you can make to an existing RAG system. Instead of retrieving the top 5 and using them directly, retrieve 20–50 candidates, then run a cross-encoder reranker that scores each query-document pair jointly, and keep the top 5 after reranking. The top 5 after reranking are dramatically better than the naive top 5, because the reranker sees the query and document together and captures nuances that the independent embeddings missed.

Cross-encoders (e.g. Cohere Rerank, BGE-reranker): most accurate, more expensive per pair. The standard choice.
Late interaction (ColBERT): encodes query and doc separately, computes token-level similarity — high accuracy at better latency.
Score-based: cheap heuristic reordering (BM25 boosts). Fast, less nuanced.

4. Query Transformation

Users phrase questions badly. Before retrieving, rewrite and expand: generate 2–3 variations of the query (expand abbreviations, try synonyms, add domain context), retrieve for each, and fuse. For complex multi-part questions, decompose into sub-questions and retrieve for each independently. A related technique, HyDE (Hypothetical Document Embeddings), generates a fake "ideal answer" and embeds that to retrieve against — often more similar to real documents than the raw question is.

5. Metadata Filtering

The cheapest accuracy win nobody uses enough. Attach metadata (date, source, document type, department, access level) to every chunk and filter before or during vector search. Scoping "Q4 2025 financial reports" to actual Q4 2025 financial documents eliminates entire categories of wrong answers and is essentially free.

You Cannot Improve What You Don't Measure

The number one reason RAG projects stall: no evaluation harness. Teams tweak chunk sizes and prompts based on vibes from spot-checking a handful of queries. Build an eval set of representative question-answer pairs and measure the retrieval and generation halves separately.

Metric	What it measures	Half
Context Recall	Did retrieval surface the chunks containing the answer?	Retrieval
Context Precision	Are the retrieved chunks mostly relevant (not noise)?	Retrieval
Faithfulness	Is the answer grounded in retrieved context (not hallucinated)?	Generation
Answer Relevance	Does the answer actually address the question?	Generation

The split matters enormously for debugging. Low context recall → fix retrieval (chunking, hybrid search, reranking). High recall but low faithfulness → fix the prompt or the model (it has the right context but isn't using it). Frameworks like RAGAS, TruLens, and Arize Phoenix automate these measurements, several using an LLM-as-judge to score faithfulness and relevance at scale.

The trap that wastes the most time: optimizing the generation prompt when the real problem is retrieval. If your context recall is 60%, the model is being asked the answer with the relevant text absent from its context 40% of the time. No prompt fixes that. Always measure context recall before you touch the prompt — it's the fastest way to know which half to debug.

Where This Series Goes Next

This was the conceptual foundation: RAG is a search problem, hybrid search and reranking are where the wins live, GraphRAG handles relationships that vector search structurally cannot, and nothing improves without measurement. Part 2 takes all of this onto AWS — Amazon Bedrock Knowledge Bases as managed RAG, the GraphRAG capability backed by Neptune Analytics, and when to build your own retrieval stack instead. Part 3 puts it to work on a genuinely hard domain: a clinico-genomics RAG where wrong answers have consequences and explainability isn't optional.

📚 Continue the series

RAG From the Ground Up (this article)
RAG on AWS: Bedrock Knowledge Bases, GraphRAG & Neptune →
Building a Clinico-Genomics RAG on AWS →