AI Agent Memory: The Infrastructure Layer Nobody Told You About | Dmitry Shirokov

Most agent demos work great. Your agent remembers what you said two messages ago, follows through on a multi-step task, and everyone in the room is suitably impressed. Then you deploy it to production. Your first real user comes back three days later and says: "Didn't I already tell you my name?" And you realize — the context window is not memory. It's a scratchpad. Everything on it disappears when the session ends.

For agents that need to work with humans over days, weeks, or months — learning preferences, tracking project history, accumulating domain knowledge — this is a fundamental architectural gap, and one that the "just use a bigger context window" crowd has been trying to paper over with varying success. This post is about what agent memory actually is, how the major platforms implement it, what can go wrong in production, and what the security picture looks like heading into 2026.

The Four Memory Types (Not One)

The CoALA paper (Princeton, 2023) formalized a memory taxonomy for language model agents that maps directly to cognitive science. It's more useful as an architecture decision tool than as academic theory, because each type has different storage and retrieval characteristics that affect your implementation choices.

flowchart TB
    U(["User / Tool Input"]) --> W

    subgraph W ["Working Memory — In-Context"]
        WC["Active context window\nCurrent session only · Ephemeral\nCost scales with every token"]
    end

    W <-->|"retrieve on session start\nwrite at end async"| EP
    W <-->|"semantic lookup\nfact injection"| SM
    W <-->|"skill retrieval\nexample selection"| PR

    subgraph EP ["Episodic Memory"]
        E1["Past interactions · Conversation logs\nWhat happened and when\n→ Vector store / time-series DB"]
    end

    subgraph SM ["Semantic Memory"]
        S1["World knowledge · User facts\nEntity relationships · Domain rules\n→ Knowledge graph / RAG index"]
    end

    subgraph PR ["Procedural Memory"]
        P1["Learned workflows · Tool patterns\nHow to do things well\n→ System prompts / Fine-tuning"]
    end

Four memory types, four different storage and retrieval patterns. Most agent architectures only implement one or two. Most production failures trace back to the missing ones.

In practice, most agents only implement in-context memory (the default, unavoidable) and maybe episodic memory (conversation logs). Semantic memory — the accumulation of structured facts about users, entities, and domain context — is where the interesting personalization and reasoning lives, and it's the type most commonly absent from v1 implementations. Procedural memory is usually baked in as the system prompt rather than being genuinely learned, which is why agents don't actually improve from experience unless someone updates the prompt manually.

Why Bigger Context Windows Don't Solve This

Gemini 1.5 Pro has a 2M token context window. Claude 3 handles 200K. GPT-4o supports 128K. And every few months someone argues that external memory is a short-term workaround that will become unnecessary as context windows grow. This argument has three problems.

Cost. Sending 200K tokens of history in every API call for every user interaction is genuinely expensive. An agent handling 50 daily interactions, each requiring 6 months of context, would burn through a cloud budget faster than an overeager intern with a corporate card. External memory lets you retrieve the 3–5 most relevant memories and inject only those — a fraction of the token cost.

Recall quality. Research consistently shows that LLM recall accuracy degrades for information buried in the middle of long contexts — the "lost in the middle" effect (Liu et al., 2023). Bigger context windows make the problem worse, not better: there's just more middle. Structured external retrieval with semantic search outperforms "stuff everything in the prompt" on recall benchmarks.

Session boundary. No matter how big the context window, it resets when the session ends. There is no mechanism in a standard LLM API call that persists state forward across sessions without explicit external storage. This is a hard architectural constraint, not a temporary limitation waiting to be engineered away.

The Three Retrieval Approaches

Vector store (RAG-based episodic memory)

Every conversation turn gets embedded and stored. At session start, the agent embeds the current query, runs a nearest-neighbor search, and retrieves the k most semantically similar memories to inject into context. This is the fastest path from zero to working memory and the most widely deployed approach.

It works well for unstructured, conversational memories: "user mentioned they prefer Python over Node.js," "user's deadline is end of Q3," "previous attempt at X failed because of Y." The semantic similarity search surfaces relevant memories even when the current query doesn't use the exact same words as the stored memory.

Where it breaks: vector similarity is not the same as factual relevance. A query about "the billing system" might surface five memories mentioning "billing" from different projects, different time periods, and different contexts — all semantically close, none of them the right one. Multi-hop questions ("what did we decide about the API that connects to the service the user mentioned last month?") are structurally unsolvable by pure vector similarity. And retrieval adds 200–500ms per query before the LLM even starts.

Knowledge graph memory

Entities and relationships are stored as graph nodes and edges. "User → prefers → dark mode," "Project Alpha → depends_on → billing_service," "billing_service → owned_by → Team B." Graph traversal can answer multi-hop questions by following relationship paths — something flat vector search fundamentally cannot do.

Benchmarks from 2025 production deployments consistently show graph methods at ~92% recall and ~88% precision versus vector RAG at ~85% recall and ~75% precision. That gap matters when an agent is making consequential decisions based on retrieved context. Zep's time-aware graph architecture (Rasmussen et al., 2025) adds temporal relationships, so the agent can reason about how facts changed over time — not just what's true now, but what was true when.

The cost: considerably more engineering. Entity extraction at write time, schema design for your relationship types, graph database operations. For purely conversational memory with no entity relationships, it's overkill. For agents that need to reason about interconnected entities over time — customer relationship management, project tracking, medical history — it's the right tool.

Hybrid (what mature systems actually use)

Vector store for episodic recall plus knowledge graph for entity relationships plus structured storage for explicit user facts. Mem0's architecture runs vector search and graph traversal in parallel and merges the results. All three major cloud platforms have converged on some version of this approach. It's more complex to operate but avoids the failure modes of any single approach.

The Read/Write Pipeline Under the Hood

sequenceDiagram
    participant U as User
    participant A as Agent
    participant M as Memory Store
    participant LLM as LLM

    U->>A: New message
    A->>M: retrieve(query, user_id, top_k=5)
    M-->>A: relevant memory records
    Note over A: Build context: memories + recent history + message

    A->>LLM: [system prompt + memories + message]
    LLM-->>A: response
    A-->>U: response

    A-)M: extract_and_store(conversation) — async
    Note over A,M: LLM extracts key facts from the turn.\nADD / UPDATE / DELETE / NOOP.\nDoes not block the response.

The read path is synchronous and latency-sensitive. The write path is async — extraction and storage happen in the background after the response is delivered. Most platforms (Bedrock AgentCore, Vertex Memory Bank) follow this pattern exactly.

The async write design is deliberate. Running an LLM extraction pass over every conversation turn adds latency if done synchronously. By decoupling the write from the response, the memory system can take as long as it needs — including running extraction retries, deduplication, and contradiction resolution — without affecting the user experience. The tradeoff: a user might return slightly before their last conversation has been fully processed into memory. For most use cases this is acceptable. For real-time collaborative agents, it might not be.

What Each Platform Actually Does

AWS

Amazon Bedrock AgentCore Memory

Launched at AWS Summit NYC 2025. Fully managed, two-tier architecture: short-term memory captures raw interaction context within a session (DynamoDB-backed), while long-term memory runs an async extraction pipeline that distills conversations into persistent semantic records.

Retrieval via RetrieveMemoryRecords does semantic search across stored records. Compatible with any agent framework — LangGraph, CrewAI, LlamaIndex, OpenAI Agents SDK. AWS selected Mem0 as the default memory provider for the Strands Agent SDK in May 2025.

Honest assessment: Solid managed option if you're already AWS-native. The async extraction means your first query after a session might not yet see the latest memories.

GCP

Vertex AI Memory Bank

In public preview since late 2025. Built on methodology from Google AI Research (accepted at ACL 2025). Uses Gemini to extract facts asynchronously with a topic-based indexing approach.

Distinct separation between Sessions (within-conversation state, like a shopping cart) and Memory Bank (cross-session persistence, like a user profile). Memory Bank specifically handles contradiction resolution — if you said one thing six months ago and the opposite last week, it tries to reconcile rather than silently keeping both.

Honest assessment: The contradiction resolution is genuinely useful and differentiating. ADK integration is clean if you're building Google-native. Early preview roughness in the API surface.

OpenAI

Assistants API / Responses API

OpenAI's approach is more fragmented. Threads provide persistent conversation history. Vector stores via the File Search tool provide document RAG (up to 100M files as of Nov 2025). There is no first-party long-term semantic memory service equivalent to AgentCore or Memory Bank.

The Sessions primitive in the Agents SDK handles within-session state (SQLite or Redis-backed). Cross-session memory is your problem — the ecosystem fills the gap: Mem0, Zep, LangMem, and Hindsight are all common choices for OpenAI agent deployments.

Honest assessment: Flexible but requires more assembly. If you're already using the Assistants API, the thread persistence is convenient. For production agents with real memory requirements, you'll need an external layer.

Anthropic / Claude

Memory Tool + MCP

Anthropic's approach is intentionally minimal and transparent. The Memory Tool in the API lets agents read and write a file-based memory directory — plain Markdown files that are auditable, editable, and don't require a vector database. For simpler use cases, this is genuinely sufficient and much easier to reason about than black-box vector retrieval.

For production, Claude integrates with external memory systems via MCP (Model Context Protocol). Mem0, Chroma, and other memory stores expose MCP servers. Claude calls them using standard tool use — the same mechanism it uses for any other tool. Composable by design, but you're responsible for the plumbing.

Honest assessment: The file-based approach works well for single-user or small-scale deployments where you want full visibility. Multi-tenant production systems need an external memory layer; MCP makes the integration clean but doesn't build it for you.

Platform Comparison at a Glance

Platform	Short-term memory	Long-term memory	Contradiction handling	Framework support
AWS AgentCore	Session-scoped (DynamoDB)	Managed semantic records (async extraction)	Limited	LangGraph, CrewAI, Strands, OpenAI SDK, LlamaIndex
GCP Memory Bank	Sessions (separate object)	Topic-indexed, Gemini-extracted (async)	Built-in resolution	ADK, LangGraph, CrewAI
OpenAI Assistants	Threads (persistent history)	None first-party — ecosystem (Mem0, Zep)	None first-party	Any via Responses API
Claude + MCP	Context window / Session tool	File-based or MCP-connected external store	External system dependent	Any via MCP servers

The Security Problem Everyone Underestimates

Memory persistence introduces an attack surface that most teams aren't thinking about yet. The threat is called memory poisoning, and it's more sophisticated than it sounds.

MINJA Attack (NeurIPS 2025): Researchers demonstrated >95% injection success rates against production agents using only query-level interaction — no direct access to the memory store required. The attack crafts inputs designed to be stored in memory as "plausible learnings," which then influence future agent behavior when semantically triggered. Temporally decoupled: poison planted day 1 may not execute until day 40. OWASP lists this as a top agentic risk (ASI06, 2026).

What makes memory poisoning particularly nasty is the temporal decoupling. Traditional prompt injection attacks execute immediately and are visible in the response. Memory poisoning is a sleeper: the attack gets stored, the conversation ends, and the malicious influence only activates when a future conversation matches the right semantic trigger. Between October 2025 and January 2026, honeypots captured over 91,000 attack sessions actively probing agent memory endpoints.

The defense landscape is thin. Detection-based moderation is partially effective but can be bypassed by attacks that embed plausible reasoning within contextually harmless content. More robust mitigations are architectural:

Write-time validation: Apply a separate LLM call (or rule-based filter) to proposed memory writes before storing them. Anything that looks like an instruction rather than a fact should be rejected.
Scope limits: Don't give every agent access to the full memory store. User-scoped memory, workspace-scoped memory, not global agent memory. The blast radius of a poisoned record should be bounded.
TTL on stored records: Memory records that expire naturally limit the window during which a poisoned record can cause harm.
Audit logging: Every memory write should be logged with the session, timestamp, and source. Forensics after an incident are impossible without this.
Human review for high-stakes writes: For agents operating in sensitive domains, require human approval before certain categories of memory are persisted.

Production Failure Modes

Security aside, the patterns that appear consistently in production deployments:

Memory entropy

Over time, the memory store fills with outdated, contradictory, and low-quality records. "User prefers dark mode" from 18 months ago, "user prefers light mode" from last month. Without explicit contradiction handling or TTL policies, the agent starts reasoning from stale state. This looks like hallucination to users but is actually a retrieval problem — and users almost never file a bug report that says "your agent retrieved an outdated memory."

Retrieval noise

Semantic similarity retrieval is good at finding related memories, not always the right memories. An agent helping with a Python data pipeline might retrieve memories from a previous Python project that ended two years ago — different stack, different constraints, no longer relevant. Now the context window has outdated assumptions baked in. The agent doesn't know the memories are old. It just uses them.

Over-confident memory use

Agents with memory systems start using them for everything. Instead of asking the user a clarifying question — which a memoryless agent would do — the agent retrieves something from six months ago and acts on it confidently. The memory was correct then. It's stale now. The failure looks like the agent is making things up; the actual cause is a retrieved fact that's no longer valid. Agents need uncertainty signals on memories, not just retrieval scores.

Cross-user contamination

In multi-tenant systems, memory isolation is a hard requirement. A memory store scoped at the system level rather than the user level isn't just a privacy violation waiting to happen — it means User A's preferences, projects, and history can surface in User B's session. This is both a GDPR problem and a correctness problem. Scope your memory stores at user or workspace level from day one. Retrofitting isolation into a production system is painful.

On-Premises and Hybrid Deployments

Regulated industries — healthcare, finance, defense — often can't use managed cloud memory services for long-term user data. The on-premises story is more fragmented than the cloud offerings, but it works.

Self-hosted vector store: Chroma, Weaviate, or Qdrant running in your own infrastructure. Pair with a custom extraction pipeline (LLM call on each conversation turn to extract facts, structured output, write to store). More engineering than a managed service, but full control over data residency and retention.

PostgreSQL + pgvector: If you're already running Postgres, the pgvector extension adds vector similarity search. Combined with a metadata schema for memory records (user_id, created_at, expires_at, content, embedding), this handles both episodic retrieval and structured fact storage with a single operational stack. Don't underestimate this option — it's simpler to operate than a dedicated vector database and the performance at moderate scale is entirely adequate.

Self-hosted Mem0: Mem0 supports self-hosted deployment and integrates natively with LangGraph, CrewAI, and the OpenAI Agents SDK. The Open Memory Protocol (OMP) standard they're developing aims to make memory store implementations interchangeable — useful if you expect to migrate later.

Graph database (Neo4j or Neptune): For agents with heavy entity-relationship reasoning requirements, a self-hosted Neo4j instance or Amazon Neptune gives you the graph traversal capabilities that pure vector search can't provide. Higher operational overhead, but the right tool for knowledge-graph-heavy use cases.

The honest assessment of on-prem: more engineering effort, less mature tooling than cloud offerings, no managed extraction pipeline. For regulated workloads where data residency is non-negotiable, that tradeoff is correct. For everyone else, start with a managed cloud service and move on-prem if you hit a compliance wall.

Best Practices: What the Production Deployments Agree On

Don't start with memory. Build the agent without it first. Understand exactly where it fails because of missing persistence. Then design the memory layer for those specific failure cases rather than adding a generic memory system and hoping it improves things. Memory that doesn't fix a real problem adds latency, complexity, and attack surface for no benefit.

Separate sessions from memory. Short-term within-session state and long-term cross-session memory are different objects with different lifecycle requirements. Keep them architecturally separate from the start. Session state is ephemeral. Memory is persistent. They need different storage backends, different TTL policies, and different access controls.

Write TTLs, not just inserts. Every memory record should have an expiration policy. User preferences change. Project contexts become stale. Memory stores without TTLs turn into archaeological sites — layers of outdated facts that confuse the agent and quietly degrade performance over months.

Validate before storing. Run a quality filter on memory writes. A confidence threshold, a validation prompt checking that the extracted fact is actually a fact (not an instruction, not a hallucination, not something ambiguous). Don't store raw LLM outputs as ground truth without verification.

Treat memory as an attack surface. Input validation, access controls scoped to user/workspace, audit logging on reads and writes, scope limits per agent. The same security hygiene you'd apply to a user database applies to an agent memory store. Probably more so, because the impact of poisoned memory is subtle and hard to detect.

The Bigger Picture

Memory is what separates an agent that can complete tasks from an agent that can build a working relationship. The former is genuinely useful. The latter is transformative — an agent that actually knows your codebase, your preferences, your constraints, the decisions you made three months ago and why. That's not demo territory anymore; it's production-ready in 2026 if you architect it carefully.

The technical ingredients exist: managed cloud memory services from AWS, Google, and the OpenAI ecosystem, mature third-party libraries like Mem0 and Zep, and a growing understanding in the community of what the failure modes look like and how to avoid them. The hard part isn't the technology anymore. It's deciding what your agent should remember, for how long, and who gets to see it — which turns out to be mostly a governance problem wearing a technical costume.