Retrieval-Augmented Generation sounds simple on paper: give the LLM some context alongside the question, and it stops making things up. In practice, a RAG system that works in a demo and a RAG system that works in production are separated by a list of painful surprises that nobody writes tutorials about. This article is mostly about the surprises.
Google Cloud has three distinct RAG paths, a growing set of components to assemble them from, and a pricing model that can silently double your inference bill if you're not watching your context window. We'll cover the full picture: what each GCP service is actually for, how to build a production pipeline, the chunking and embedding decisions that determine retrieval quality, the operational problems you'll hit at scale, and how the GCP approach compares to AWS and Azure.
The GCP RAG Spectrum
Before touching any code, understand that Google Cloud doesn't have one RAG service — it has three, sitting at different points on the control-vs-complexity tradeoff:
Vertex AI Search
When: enterprise document searchFully managed search and grounding. You give it documents (or point it at a data store), and it handles chunking, embedding, indexing, and search. Minimal configuration, opaque internals. Best for internal knowledge bases, customer-facing search, and grounding Gemini in unstructured documents when you don't want to manage any infrastructure.
Vertex AI RAG Engine
When: custom pipeline, managed infraThe "sweet spot" service. You create a corpus, import documents from GCS or Drive, and the engine handles chunking, embedding with text-embedding-005, and stores vectors in a managed Spanner-based database. At query time, it retrieves relevant chunks and injects them into a Gemini prompt. More control than Vertex Search, less operational burden than DIY.
DIY with Vector Search / AlloyDB
When: maximum control + scaleYou manage everything: Cloud Storage → embedding pipeline (Dataflow / Cloud Run) → vector database (Vertex AI Vector Search or AlloyDB with pgvector) → retrieval + prompt assembly → Gemini API. Most flexible, most operational burden, cheapest at scale for high-throughput applications. Necessary when you need SQL-based retrieval, hybrid search, or your own embedding models.
The right choice depends mostly on how custom your retrieval needs to be. If your documents are PDFs, Word files, and web pages and your queries are natural-language questions, Vertex AI RAG Engine covers 80% of use cases with a fraction of the engineering cost. If you need to combine vector similarity with relational filters ("find documents about Q3 revenue, but only from the Finance department, created after 2024-01-01"), DIY with AlloyDB AI gives you SQL WHERE clauses alongside pgvector similarity queries — something the managed services can't do.
Architecture: What a Production RAG Pipeline Looks Like
flowchart TD
subgraph Ingestion["Ingestion Pipeline (offline)"]
GCS["Cloud Storage\nPDFs, DOCX, HTML, CSVs"]
PubSub["Pub/Sub trigger\n(new file → process)"]
CRun["Cloud Run / Dataflow\nParse + chunk + embed"]
EmbAPI["Vertex AI Embeddings\ntext-embedding-005\n768 or 3072 dims"]
VDB["Vector Store\n(RAG Engine Spanner\nor AlloyDB pgvector\nor Vector Search)"]
GCS --> PubSub --> CRun --> EmbAPI --> VDB
end
subgraph Query["Query Pipeline (online)"]
User["User query"]
QEmbed["Embed query\n(same model as ingestion)"]
Retrieve["Vector similarity search\ntop-k chunks (k=5–10)"]
Rerank["Optional reranker\n(Vertex Rank API)"]
PromptBuild["Prompt assembly\nSystem + context chunks + query"]
Gemini["Gemini 1.5 Pro / Flash\nGeneration"]
Resp["Response + citations"]
User --> QEmbed --> Retrieve --> Rerank --> PromptBuild --> Gemini --> Resp
end
subgraph Observability["Observability"]
CloudLog["Cloud Logging\nlatency, token counts"]
Metrics["Cloud Monitoring\nP50/P95 end-to-end latency"]
Eval["Vertex AI Evaluation\nfaithfulness, answer relevance"]
end
VDB --> Retrieve
Gemini -.-> CloudLog
CloudLog --> Metrics
Resp -.-> Eval
Production RAG on GCP has two distinct pipelines: an offline ingestion pipeline that runs when documents change, and an online query pipeline that must stay under 2s P50. They share only the vector store and the embedding model — changes to either require re-indexing the corpus.
The most important architectural decision: use the same embedding model for ingestion and retrieval. This sounds obvious but causes silent failures in practice. Switching from textembedding-gecko-003 to text-embedding-005 without re-embedding your entire corpus produces nonsense retrieval results — the vector spaces are not interchangeable. Pin your embedding model version explicitly and run corpus re-indexing as a breaking-change migration.
Vertex AI RAG Engine: Internals and Tradeoffs
RAG Engine launched GA in 2024 and received significant updates through 2025, including multi-corpus support, retrieval filter API, and Serverless mode. The managed database backing it is Spanner — Google's globally distributed ACID database repurposed as a vector store. You don't see this; you work with the RAG Engine API. But the Spanner choice explains why RAG Engine is expensive relative to AlloyDB pgvector for high-insert-throughput workloads (Spanner pricing is per operation, not just per storage).
Creating a Corpus and Importing Documents
import vertexai
from vertexai.preview import rag
vertexai.init(project="my-project", location="us-central1")
# Create corpus
corpus = rag.create_corpus(
display_name="product-docs-v2",
embedding_model_config=rag.EmbeddingModelConfig(
publisher_model="publishers/google/models/text-embedding-005"
),
)
# Import from GCS — RAG Engine handles chunking automatically
rag.import_files(
corpus_name=corpus.name,
paths=["gs://my-bucket/docs/"],
chunk_size=512, # tokens per chunk
chunk_overlap=100, # overlap for context continuity
)
print(f"Corpus created: {corpus.name}")
A single corpus handles documents well, but a single massive corpus handles them poorly. If your organization has a 50,000-document knowledge base spanning Finance, Engineering, Legal, and HR, putting everything in one corpus means retrieval must search across semantically different domains simultaneously. The result is that a finance query retrieves HR documents ranked higher than finance documents because they happen to share vocabulary. Segment into multiple specialized corpora — one per domain, team, or subject area — and route queries to the appropriate corpus at the application layer.
The context window cost cliff: Gemini 1.5 Pro charges $3.50/M input tokens for prompts under 128k tokens, and $7.00/M for prompts over 128k. A RAG pipeline that retrieves 10 chunks of 512 tokens each adds 5,120 tokens of context. That's fine. A pipeline that retrieves whole documents or uses a large-k search adds 50,000+ tokens, which can push every request over the 128k threshold and silently double your inference cost. Set a hard context budget: retrieve top-k chunks where k × chunk_size stays well under 30,000 tokens.
Chunking: The Decision That Determines Everything
Chunking is the most impactful decision in any RAG pipeline and the one most often made wrong. A 2025 Vectara study across 25 chunking configurations and 48 embedding models found that chunking strategy influences retrieval quality as much as or more than the embedding model itself. Getting chunking right matters more than chasing the latest embedding benchmark.
The Four Strategies Worth Knowing
- Fixed-size (512 tokens, 10–20% overlap): The default and usually fine. Recursive splitting on paragraph/sentence boundaries before falling back to character count. Avoid fixed-size without overlap — adjacent chunks lose cross-boundary context. This is the baseline; start here.
- Semantic chunking: Split when the embedding similarity between consecutive sentences drops below a threshold. Chunks align with topic boundaries rather than token counts. Better for documents with diverse topics in one file; slower and more expensive (one embedding API call per sentence during ingestion).
- Document-structure-aware chunking: Use the document's own structure (headings, sections, list items) as chunk boundaries. Tables become individual chunks. Works well for structured documents like API docs, policy manuals, and technical specifications. Requires a proper parser (Vertex AI Document AI, LlamaParse, or Docling).
- Hierarchical chunking: Store both a summary chunk (large, for retrieval context) and fine-grained sub-chunks (small, for precise answer extraction). Query against summaries for broad retrieval, then refine with sub-chunks. The "Parent Document Retriever" pattern in LangChain implements this. Best for long documents with complex internal structure.
For most GCP RAG projects: use fixed-size 512 tokens with 20% overlap for RAG Engine (since it controls chunking internally anyway, just set the parameters), and document-structure-aware chunking for any DIY pipeline processing technical docs, contracts, or reports with clear section headers.
AlloyDB AI: SQL-Native Vector Search
If your use case needs hybrid retrieval — filtering by metadata alongside vector similarity — AlloyDB AI with pgvector and Google's ScaNN index is the most capable GCP-native option. AlloyDB's embedding() function calls Vertex AI's text embedding model in-database, eliminating the need for a separate embedding pipeline:
-- Ingest: store document chunks with auto-generated embeddings
INSERT INTO document_chunks (doc_id, chunk_text, chunk_embedding, department, created_at)
VALUES (
'doc-001',
'Q3 revenue increased 18% YoY driven by enterprise segment growth...',
embedding('text-embedding-005', 'Q3 revenue increased 18% YoY...'),
'Finance',
NOW()
);
-- Hybrid retrieval: vector similarity + metadata filters
SELECT doc_id, chunk_text, 1 - (chunk_embedding <=> embedding('text-embedding-005', $1)) AS score
FROM document_chunks
WHERE department = 'Finance'
AND created_at >= '2024-01-01'
ORDER BY score DESC
LIMIT 5;
AlloyDB's ScaNN index (available since AlloyDB AI in 2024) runs vector queries up to 10x faster than standard PostgreSQL IVFFlat index on the same hardware. The <=> operator is the cosine distance; smaller is more similar, so 1 - distance gives a 0-to-1 similarity score.
The trade-off: AlloyDB is priced like a database ($0.30+/vCPU-hour), not like a managed vector service. For a team storing 1 million document chunks, AlloyDB runs ~$450/month minimum (2 vCPU, 16 GB). Vertex AI RAG Engine Serverless mode for the same corpus might cost $80/month in storage plus retrieval calls. AlloyDB wins at high-throughput hybrid workloads; RAG Engine wins at simple vector-only retrieval with moderate volume.
Common Production Problems
1. Retrieval Relevance Collapse at Scale
A RAG system that works perfectly on your 500-document test corpus often degrades badly when you ingest the full 50,000-document production corpus. The reason: embedding space density. More documents means more near-neighbors for any given query vector, and top-k retrieval starts returning semantically adjacent but contextually irrelevant chunks.
Fix: Add a reranking step. Vertex AI's Rank API (semantic reranker) takes the top-50 retrieved candidates and reranks them using a cross-encoder model — much more accurate at relevance scoring than approximate nearest-neighbor distance alone. The cost is one Rank API call per query (~$0.001); the latency addition is ~100ms. Almost always worth it for corpora over 10,000 documents.
2. Hallucination From Over-Retrieval
Counterintuitively, retrieving more chunks often produces worse answers. With 20 chunks in the prompt, the model struggles to distinguish which chunks are actually relevant and starts synthesizing across contradictory sources. Production hallucination rates above 10% are often caused by too many retrieved chunks, not too few.
Fix: Start with k=5 and measure faithfulness with Vertex AI Evaluation's faithfulness metric (which checks if every claim in the answer is supported by a retrieved chunk). Increase k only if you're seeing "I don't have information about..." refusals, not if you're seeing confident wrong answers.
3. Cold Start Latency on Cloud Run
A Cloud Run-hosted RAG service that scales to zero will have cold starts of 2–8 seconds — long enough that users assume the service is broken. RAG query pipelines load the embedding model client on startup, which adds to this.
Fix: Set Cloud Run minimum instances to 1 for any user-facing RAG service. At $0.000048/vCPU-second idle, one always-warm instance costs ~$3/month. Worth it. For batch or async RAG (summarization pipelines, document processing), scale-to-zero is fine.
4. Embedding Model Version Drift
Google periodically updates embedding models (gecko-003 → text-embedding-004 → text-embedding-005). Each update produces a different vector space. Any corpus embedded with an old model returns bad results if queried with a new model.
Fix: Treat embedding model version as a versioned dependency in your infrastructure as code (Terraform variable). When you upgrade the model, trigger a full corpus re-embedding as part of the deployment. Track corpus embedding version in AlloyDB/RAG Engine metadata. Never upgrade the query embedding model without upgrading the ingestion pipeline simultaneously.
5. The 200k Token Context Cliff (Gemini-Specific)
Gemini 1.5 Pro has a 2M context window, which sounds like unlimited RAG context. The reality: pricing doubles at 128k tokens ($3.50 → $7.00/M input tokens). Applications that naively pass entire documents rather than chunks, or that accumulate conversation history alongside retrieved context, can silently cross this threshold and double the inference bill without any visible error.
Fix: Instrument token counting explicitly. Use model.count_tokens() before every Gemini call in staging and alert when prompt size approaches 100k tokens. Set a hard prompt budget in your RAG assembly code.
TCO Comparison: Three Approaches on GCP
| Approach | Monthly infra cost (10k docs, 5k queries/day) |
Setup time | Flexibility | Best for |
|---|---|---|---|---|
| Vertex AI Search | ~$300–600 (data store + query volume) | Hours | Low | Enterprise document search, no custom logic |
| Vertex AI RAG Engine | ~$80–200 (Serverless storage + API calls) | 1–3 days | Medium | Custom chunking/embedding, moderate query volume |
| AlloyDB AI + DIY | ~$450–800 (AlloyDB min config + Cloud Run) | 1–3 weeks | Maximum | Hybrid SQL+vector, high volume, custom embedding models |
| Vector Search + DIY | ~$150–400 (Vector Search + Cloud Run) | 1–2 weeks | High | High-scale pure-vector retrieval, >100k docs |
All approaches share Gemini inference costs, which often dominate at scale: Gemini 1.5 Flash at $0.075/M input tokens is the economical choice for high-volume RAG; Pro at $3.50/M is for complex reasoning tasks where Flash's quality isn't sufficient. A 5,000 query/day RAG service using Flash with 10k input tokens per query costs ~$1.13/day in inference — very manageable. The same volume on Pro costs ~$52/day. Model selection is the largest cost lever.
Agentic RAG: The Next Step
Classic RAG is one retrieval step. The system retrieves, then generates. This fails on multi-hop questions ("What was the revenue growth in the division that launched the most products in 2024?") that require chaining multiple retrievals based on intermediate answers.
Agentic RAG adds a reasoning step: the model decides what to retrieve, retrieves it, decides if it needs more, retrieves again, and then generates. Google's Agent Development Kit (ADK) provides the framework for this on GCP, with Vertex AI RAG Engine or AlloyDB as the retrieval backends. The ADK agent runs on Cloud Run or Vertex AI Agent Engine, with Vertex AI Evaluation measuring end-to-end task completion rather than single-turn faithfulness.
sequenceDiagram
participant U as User
participant A as Gemini Agent (ADK)
participant C as Corpus / AlloyDB
participant G as Gemini LLM
U->>A: "What drove revenue growth in the top-performing division last quarter?"
A->>G: Plan: what do I need to retrieve?
G-->>A: Step 1 - retrieve top-performing division
A->>C: vector search("top-performing division Q3 revenue")
C-->>A: Finance division +18% YoY
A->>G: Now retrieve drivers for Finance division growth
G-->>A: Step 2 - retrieve Finance growth drivers
A->>C: vector search("Finance division revenue drivers 2024")
C-->>A: Enterprise segment, new product launches, APAC expansion
A->>G: Synthesize final answer with citations
G-->>A: Grounded response with source references
A-->>U: Answer + citations [doc-042, doc-117, doc-203]
Agentic RAG uses the LLM as a retrieval planner, not just a generator. Each retrieval step is conditioned on the results of the previous one, enabling multi-hop reasoning over large document corpora that single-retrieval RAG cannot handle.
Quick Alternative Comparison
| Feature | GCP (RAG Engine + Gemini) | AWS (Bedrock Knowledge Bases) | Azure (AI Search + OpenAI) |
|---|---|---|---|
| Managed vector store | Spanner-backed (RAG Engine) or AlloyDB | OpenSearch Serverless or Aurora pgvector | Azure AI Search (hybrid search built-in) |
| Hybrid search (vector + keyword) | AlloyDB DIY; RAG Engine: vector-only | OpenSearch hybrid out of box | Azure AI Search: best-in-class hybrid |
| LLM integration | Native Gemini grounding | Claude, Titan, Llama via Bedrock | Azure OpenAI (GPT-4o, o1) |
| Multi-corpus / namespacing | Yes — multiple RAG Engine corpora | Yes — multiple knowledge bases | Yes — index namespaces |
| Context window | 2M tokens (Gemini 1.5) | 200k (Claude 3.5) | 128k (GPT-4o) |
| Managed reranking | Vertex AI Rank API | Not native (use Cohere) | Azure AI Search semantic ranker |
| Best strength | Gemini context window + Google Search grounding | Multi-model flexibility, no LLM lock-in | Hybrid search quality, Azure ecosystem |
Azure AI Search has the most mature hybrid retrieval (BM25 + vector with semantic reranker) and is the right choice if you're in the Microsoft ecosystem. AWS Bedrock Knowledge Bases wins on LLM flexibility — you can swap between Claude, Titan, and Llama without changing your RAG pipeline. GCP wins on context window and when you're already using Google Workspace data sources (Drive, Gmail, Docs) that integrate natively with Vertex AI Search.
A Practical Starting Point
If you're starting a new RAG project on GCP in 2025, this is the stack I'd reach for before considering alternatives:
- Vertex AI RAG Engine Serverless for the vector store — no AlloyDB provisioning cost, free tier available for development.
- text-embedding-005 (768 dimensions) — Google's best general-purpose embedding model as of mid-2025. Pin the exact version in Terraform.
- Fixed-size chunking at 512 tokens, 20% overlap — the benchmark-validated default. Switch to semantic chunking only after measuring retrieval quality on your specific documents.
- Vertex AI Rank API for reranking — add this after initial retrieval, before prompting Gemini. One API call, huge quality improvement.
- Gemini 1.5 Flash for query response — cheaper and faster than Pro, sufficient quality for most RAG use cases. Use Pro only for complex multi-hop reasoning.
- Cloud Run with min-instances=1 for the query service. Scale-to-zero for anything that isn't user-facing.
- Vertex AI Evaluation running nightly on a golden question set — measure faithfulness and answer relevance weekly, not just at launch.
The one thing to track before anything else: measure your end-to-end latency P95 and your hallucination rate on week one. These two metrics, checked weekly, will catch every meaningful regression before your users do. Everything else — reranking, chunking strategy, model upgrades — should be evaluated by whether it moves those two numbers in the right direction.
RAG on GCP isn't complicated in principle. It's complicated in the details: embedding model versions, context budgets, corpus segmentation, cold starts, and reranking. The managed services handle the infrastructure complexity well. The rest is engineering discipline: pin your versions, measure retrieval quality, set context budgets, and don't confuse "works in a demo" with "works in production at 5,000 queries per day."