Building a RAG System on GCP for a Real Estate Agency

📚 This is Part 2 of a 2-part series on RAG on Google Cloud

RAG on GCP: From First Corpus to Production
Building a RAG System on GCP for a Real Estate Agency (you are here)

Part 1 covered the GCP RAG toolbox in the abstract. This part is the opposite: one concrete domain, end to end. We'll build the retrieval system a mid-sized residential real estate agency actually needs — the kind with a few thousand active listings, a dozen agents, and a website where buyers ask questions in plain English. Real estate turns out to be a near-perfect stress test for RAG, because it breaks the naive "embed everything and search" approach in four different ways at once.

By the end you'll have a reference architecture on Google Cloud, the data model that makes hybrid retrieval work, the retrieval flow that combines hard filters with semantic search, and — the part everyone forgets until legal calls — the compliance guardrails that keep a property chatbot out of Fair Housing trouble.

What the Agency Actually Wants

Strip away the buzzwords and there are three concrete products hiding inside "we want AI for our listings":

Buyer-facing search assistant — a website chatbot that answers "show me 3-bed homes under $750k near downtown with a yard and a home office" and returns real, current listings with links and photos.
Agent copilot — an internal tool that drafts listing descriptions, summarizes comparable sales ("comps"), answers questions about disclosures and contracts, and pulls neighborhood facts on demand.
Document Q&A — grounded answers over the agency's pile of unstructured documents: HOA rules, inspection reports, seller disclosures, neighborhood guides, and market reports.

All three are RAG. But the first one is where the design decisions get interesting, because a listing is half structured data and half prose, and the query mixes hard constraints with soft preferences.

Why Naive RAG Fails Here

If you take the tutorial approach — dump every listing into a vector store, embed the query, return top-k — you'll ship something that demos well and falls apart in week one. Four reasons:

Problem	Why vector-only RAG breaks
Hard constraints are non-negotiable	"Under $750k" and "3 bedrooms" are filters, not similarity. A semantic search will happily return a beautiful $900k 2-bed because the description is similar. Price and bedroom count must be exact filters, not vectors.
Freshness is brutal	Listings change daily — price drops, status flips to "pending" or "sold." A stale index that shows a sold house as available is worse than no answer. The index must track the source of truth in near-real-time.
Geography is structured, not semantic	"Near downtown" or "in the Eastside school district" is a spatial/categorical filter. Embeddings don't understand a 5-mile radius or a district boundary.
Compliance is a hard wall	Under the Fair Housing Act, the system must not steer buyers based on protected characteristics (race, religion, family status, etc.) or answer "is this a good neighborhood for [group]." This can't be left to model judgment.

The fix is the pattern from Part 1's "Advanced RAG": structured pre-filtering plus semantic ranking, with the structured store as the source of truth. Hard constraints filter the candidate set deterministically; semantic search ranks what survives by how well it matches the soft preferences.

The Architecture

Here's the full system on GCP. Two ingestion paths feed one hybrid retrieval store; the query service combines structured filtering, semantic search, and reranking before grounding Gemini.

flowchart TB
    subgraph Ingest["① Ingestion"]
        MLS["MLS feed / listings DB"]
        DOCS["Documents:\ndisclosures, HOA, reports (GCS)"]
        MLS --> CF["Cloud Functions:\nnormalize + enrich"]
        CF --> BQ["AlloyDB:\nstructured fields + pgvector"]
        DOCS --> DF["Document AI +\nchunking"]
        DF --> EMB["Vertex AI Embeddings\n(text-embedding-005)"]
        PHOTO["Listing photos"] --> MM["Multimodal embeddings"]
        EMB --> BQ
        MM --> BQ
    end

    subgraph Serve["② Retrieval & serving (Cloud Run)"]
        Q["Buyer / agent query"] --> GUARD["Fair-Housing guardrail\n+ query parse"]
        GUARD --> FILTER["Structured pre-filter\n(price, beds, geo, status)"]
        FILTER --> VEC["Semantic search\nover survivors"]
        VEC --> RANK["Vertex Ranking API\nrerank top-k"]
        RANK --> GEM["Gemini:\nground + cite listings"]
        GEM --> A["Answer + listing cards"]
    end

    BQ -.serves.-> FILTER
    BQ -.serves.-> VEC

Real-estate RAG on GCP. AlloyDB holds both the structured listing fields and the pgvector embeddings, so a single store does deterministic filtering and semantic ranking. The guardrail runs before retrieval, not after.

The Data Model: One Store, Two Jobs

The single most important decision is keeping structured fields and embeddings in the same row. On GCP you have two good options for this: AlloyDB (or Cloud SQL) for PostgreSQL with the pgvector extension, or BigQuery with vector search. For a transactional, frequently-updated listings catalog with sub-second query needs, AlloyDB is the better fit — it gives you ACID updates for freshness and a vector index in the same table you filter on.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE listings (
    id              BIGINT PRIMARY KEY,
    status          TEXT NOT NULL,          -- active | pending | sold
    price           INTEGER NOT NULL,
    bedrooms        SMALLINT NOT NULL,
    bathrooms       NUMERIC(3,1) NOT NULL,
    sqft            INTEGER,
    city            TEXT,
    school_district TEXT,
    geo             GEOGRAPHY(Point, 4326), -- lat/long for radius queries
    description     TEXT,                   -- the prose half
    updated_at      TIMESTAMPTZ NOT NULL,
    embedding       vector(768)             -- text-embedding-005 dims
);

-- Index the things we filter and rank on
CREATE INDEX ON listings USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON listings (status, price, bedrooms);
CREATE INDEX ON listings USING gist (geo);

The embedding is generated from the prose half — description, neighborhood notes, standout features — not the structured fields. You never want to embed "$749,000" into a vector; you filter on it. Embed the language a buyer uses ("open-concept kitchen, walkable, great natural light") and filter the numbers.

Generating embeddings on ingest

from vertexai.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("text-embedding-005")

def embed_listing(listing: dict) -> list[float]:
    # Embed only the unstructured, preference-bearing text
    text = " ".join(filter(None, [
        listing["description"],
        listing.get("neighborhood_notes", ""),
        ", ".join(listing.get("features", [])),
    ]))
    return model.get_embeddings([text])[0].values  # 768-dim vector

Hybrid Retrieval: Filter Hard, Rank Soft

The query path is where real estate RAG lives or dies. A buyer asks: "3-bed under $750k near downtown with a home office and good light." That decomposes into hard constraints (beds = 3, price ≤ 750000, within radius of downtown, status = active) and soft preferences ("home office," "good light"). The hard constraints become a SQL WHERE; the soft preferences become the embedded query vector that orders the survivors.

-- :qvec = embedding of "home office, good natural light"
-- :lng/:lat = geocoded "downtown" centroid
SELECT id, price, bedrooms, description,
       embedding <=> :qvec AS distance
FROM listings
WHERE status = 'active'                         -- freshness: never show sold
  AND price <= 750000                            -- hard constraint
  AND bedrooms >= 3                              -- hard constraint
  AND ST_DWithin(geo, ST_MakePoint(:lng,:lat)::geography, 8000)  -- 8km
ORDER BY embedding <=> :qvec                     -- soft ranking
LIMIT 30;                                        -- retrieve wide

Two things make this work. First, the structured predicates run before the vector ordering, so the expensive similarity computation only touches listings that already satisfy the non-negotiables. Second, we retrieve wide (30) and rerank narrow: cosine distance gets us a good candidate set, but a cross-encoder reranker scores the query against each description jointly and surfaces the genuinely best matches.

from google.cloud import discoveryengine_v1 as de

def rerank(query: str, candidates: list[dict], top_n: int = 5):
    client = de.RankServiceClient()
    records = [
        de.RankingRecord(id=str(c["id"]), content=c["description"])
        for c in candidates
    ]
    resp = client.rank(de.RankRequest(
        ranking_config=RANKING_CONFIG,
        model="semantic-ranker-default@latest",
        top_n=top_n,
        query=query,
        records=records,
    ))
    keep = {r.id for r in resp.records}
    return [c for c in candidates if str(c["id"]) in keep]

Grounding Gemini — and Forcing Citations

The reranked top 5 listings go into the Gemini prompt as grounded context. The non-negotiable rule: the model answers only from the retrieved listings and must cite the listing ID for every property it mentions. A real estate answer that invents a listing or misquotes a price is a liability, not a bug.

PROMPT = """You are a real estate assistant. Answer using ONLY the listings
provided in CONTEXT. For every property you mention, cite its [listing_id].
If no listing matches, say so plainly — never invent properties or prices.
Do NOT comment on the demographics, religion, family makeup, or any protected
characteristic of a neighborhood or its residents.

CONTEXT:
{listings}

QUESTION: {question}
"""

from vertexai.generative_models import GenerativeModel
model = GenerativeModel("gemini-1.5-flash")
answer = model.generate_content(
    PROMPT.format(listings=format_cards(top5), question=user_q)
).text

Fair Housing: Compliance as Architecture

This is the part that gets agencies sued. The Fair Housing Act prohibits discrimination based on race, color, religion, sex, disability, familial status, and national origin. An AI assistant that answers "is this a good area for a family like mine?" or steers buyers toward or away from neighborhoods based on demographics is creating direct legal exposure. You cannot rely on the model to "know better" — guardrails must be structural.

Three structural defenses, none of which depend on the model behaving:

Pre-retrieval intent classification. Before any search runs, a lightweight classifier flags queries that ask about protected characteristics or request steering ("safe neighborhood for...", "mostly [group] area"). Flagged queries get a fixed, compliant response and never reach retrieval.
Curated, neutral data only. The neighborhood corpus contains facts — school ratings, transit, amenities, commute times — not demographic commentary. If steering content isn't in the index, it can't be retrieved.
Output filtering. A final check on the generated answer blocks responses that drifted into protected territory, with full logging for audit.

This is the real-estate instance of a general principle from this series: encode the rules in the system, don't hope the model infers them. Compliance, like a price filter, is too important to leave to similarity scores.

Keeping the Index Fresh

Listings change constantly, and a stale index is the fastest way to lose buyer trust. The pattern that works: treat the MLS/listings database as the source of truth and propagate changes incrementally.

Event-driven updates. A change in the listings DB (price drop, status flip) fires a Pub/Sub message; a Cloud Function updates the row in AlloyDB and re-embeds only if the description changed. Price and status updates skip re-embedding entirely — they're just column writes.
Status as a filter, never a delete. Sold and pending listings stay in the table with their status flag; the WHERE status = 'active' predicate keeps them out of buyer results while preserving them for comps and analytics.
Nightly reconciliation. A scheduled job diffs the index against the source of truth to catch anything the event stream missed — belt and suspenders for the one thing buyers will notice immediately.

Measuring Whether It Works

Two failure modes matter in real estate, and they map to the two halves of RAG. Retrieval failures (wrong or missing listings) and generation failures (hallucinated facts, fair-housing drift). Measure them separately:

Metric	What it catches	How
Constraint accuracy	Results that violate a hard filter (over budget, wrong bed count)	Deterministic assertion on result set — should be 100%
Retrieval recall	Relevant active listings the system missed	Golden query set with known-good listings, checked weekly
Faithfulness	Invented properties, misquoted prices	Vertex AI Evaluation / LLM-as-judge against retrieved context
Compliance violation rate	Steering or protected-class commentary	Adversarial query suite run on every deploy — target zero

The two numbers to watch from day one: constraint accuracy (are we ever showing a $900k house to a $750k buyer?) and compliance violation rate (did we ever steer?). Constraint accuracy should be exactly 100% because it's deterministic SQL — if it isn't, your filter logic is broken. Compliance violations should be zero on an adversarial test suite before you ever go live. Everything else is optimization; these two are go/no-go.

Lessons Learned

Embed the prose, filter the numbers. The single biggest quality win is refusing to put structured fields into the vector. Price, beds, geo, and status are SQL predicates; descriptions and features are embeddings.
One store beats two. Keeping structured columns and embeddings in the same AlloyDB row eliminates the consistency nightmare of syncing a separate vector database against a transactional listings DB. Freshness becomes a normal UPDATE.
Compliance is a pre-retrieval gate, not a post-hoc filter. The cheapest place to stop a fair-housing problem is before retrieval runs. By the time the model has generated text, you're cleaning up rather than preventing.
Status-as-filter, not delete. Sold listings are gold for comps and market reports. Flag them, don't drop them.
Gemini Flash is enough. For grounded listing Q&A, the cheaper, faster model is the right default. Reserve Pro for genuinely multi-hop reasoning like comparing five comps across neighborhoods.

The through-line from Part 1 holds with force in this domain: RAG is a search problem with a language model on the end, and the search half is where the engineering lives. In real estate, the structured constraints, the freshness pipeline, and the compliance gate are all search-side decisions. Get those right and Gemini's job becomes easy: read five real listings and answer honestly, with citations.

📚 The series

RAG on GCP: From First Corpus to Production
Building a RAG System on GCP for a Real Estate Agency (this article)