Back to Home
Research Agents

AI Agent Memory Systems: How Agents Remember, Learn, and Improve

Oliver Schmidt

DevOps engineer covering AI agents for operations and deployment.

March 18, 202619 min read

The context window is not memory. It's a whiteboard that gets erased every conversation. Yet most agent frameworks treat it as if it's sufficient, bolting on a vector store and calling the problem sol...

Agent Memory Systems: A Technical Deep Dive Beyond the Context Window

The context window is not memory. It's a whiteboard that gets erased every conversation. Yet most agent frameworks treat it as if it's sufficient, bolting on a vector store and calling the problem solved. It isn't.

Memory is arguably the single biggest bottleneck separating useful AI agents from expensive autocomplete. An agent that can't remember what it did five minutes ago, can't learn from past mistakes, and can't build a model of its environment is fundamentally limited — no matter how capable its base model.

This article breaks down the five distinct memory types that matter for agents, how they actually work at an implementation level, and where the major frameworks get them right (and wrong).

The Memory Taxonomy

Human cognitive science gives us a useful taxonomy here. Memory isn't monolithic — it's a layered system with different stores, different access patterns, and different decay characteristics. Mapping these onto agent architectures reveals why a single "memory" abstraction always falls short.

Memory Type Human Analogue Agent Implementation Access Pattern Persistence
Short-term Working memory Context window Sequential, full scan Per-conversation
Long-term Long-term storage Vector database Similarity search Permanent
Episodic Autobiographical memory Event logs + retrieval Temporal, associative Permanent
Semantic Factual knowledge Knowledge graphs / structured stores Query-based Permanent, mutable
Procedural Muscle memory / skills Tool definitions, learned routines Trigger-based Permanent, evolvable

Each of these solves a different problem. Let's dig into each one.

Short-Term Memory: The Context Window

Short-term memory in LLM agents is the context window itself — the sequence of tokens the model can attend to in a single forward pass. This is the most well-understood memory type because it's literally the model's input.

How It Actually Works

The context window is a fixed-size buffer measured in tokens. For GPT-4 Turbo, that's 128K tokens. For Claude 3.5 Sonnet, it's 200K. For Gemini 1.5 Pro, it's up to 2M tokens. But size isn't the whole story — attention is quadratic in cost (though linear approximations exist), and models demonstrably perform worse on information in the middle of long contexts (the "lost in the middle" phenomenon documented by Liu et al., 2023).

# The fundamental constraint: you must fit everything into one forward pass
def build_context(system_prompt: str, messages: list[dict], max_tokens: int) -> str:
    """Naive context construction — what most frameworks actually do."""
    context = system_prompt
    
    for msg in messages:
        candidate = context + f"\n{msg['role']}: {msg['content']}"
        if count_tokens(candidate) > max_tokens:
            break  # Drop older messages — brutal but common
        context = candidate
    
    return context

The Real Engineering Problem

The hard part isn't fitting tokens — it's deciding which tokens to keep. This is the summarization-and-eviction problem, and it's where most implementations diverge.

Strategy 1: Sliding Window with Summary

class SlidingWindowMemory:
    def __init__(self, window_size: int = 10, summary_threshold: int = 8):
        self.messages = []
        self.summary = ""
        self.window_size = window_size
        self.summary_threshold = summary_threshold
    
    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        
        if len(self.messages) > self.window_size:
            # Evict oldest messages into a summary
            overflow = self.messages[:len(self.messages) - self.summary_threshold]
            self.messages = self.messages[len(self.messages) - self.summary_threshold:]
            
            # Summarize the overflow (requires an LLM call)
            overflow_text = "\n".join(
                f"{m['role']}: {m['content']}" for m in overflow
            )
            self.summary = llm_summarize(
                f"Previous conversation summary: {self.summary}\n\n"
                f"New messages to incorporate:\n{overflow_text}"
            )
    
    def get_context(self, system_prompt: str) -> list[dict]:
        context = []
        if self.summary:
            context.append({
                "role": "system", 
                "content": f"Previous conversation summary:\n{self.summary}"
            })
        context.extend(self.messages)
        return context

The problem here is obvious: summarization is lossy. You're compressing information through an LLM that might drop crucial details. A user mentioning their database is PostgreSQL on line 3 of a 50-line conversation might not survive summarization.

Strategy 2: Importance-Based Eviction

A better approach scores messages by importance before evicting them:

importance_prompt = """Rate the long-term importance of this message on a scale of 1-10.
Consider: Does it contain facts, preferences, constraints, decisions, or context
that would be needed later in this conversation or in future interactions?

Message: {message}
Score:"""

def score_importance(message: str) -> float:
    response = llm_call(importance_prompt.format(message=message))
    return float(response.strip())

This adds latency (one LLM call per message) but preserves information that matters. OpenAI's Assistants API implicitly does something similar with its file_search tool, maintaining a compressed thread while retrieving relevant past context.

Framework Implementations

  • LangChain: ConversationBufferWindowMemory, ConversationSummaryBufferMemory — straightforward sliding window with optional summarization. The summary buffer is the most practical default.
  • OpenAI Assistants API: Manages thread state server-side. You don't control eviction directly — it happens opaquely. This is convenient until you need to understand what was lost.
  • CrewAI: Delegates to LangChain's memory abstractions under the hood. Adds a ShortTermMemory class that's essentially a buffer with configurable window size.

The honest assessment: short-term memory is a solved engineering problem but an unsolved information-theoretic one. We're always making lossy compression decisions, and we don't have good metrics for what was actually important until later.

Long-Term Memory: Vector Databases

Long-term memory is where agents persist information across conversations. The dominant implementation pattern is vector similarity search: embed information as high-dimensional vectors, store them, and retrieve relevant pieces at query time.

The Basic Pipeline

Input Text → Chunking → Embedding Model → Vector DB → Similarity Search → Retrieved Context

Every framework does this. The devil is in each step.

Chunking: Where Most Pipelines Break

The chunking strategy determines what your agent can and cannot remember. Get it wrong and you'll retrieve half-sentences or irrelevant noise.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# The default approach most people start with
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document)

This is fine for documents but terrible for conversational memory. A better approach for agent memory is semantic chunking — splitting at natural semantic boundaries rather than fixed token counts:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

# Chunks are created where semantic similarity drops significantly
chunks = semantic_splitter.split_text(conversation_transcript)

For agent-specific memory, I've found event-based chunking works best — each agent action, observation, and thought becomes its own memory unit:

@dataclass
class MemoryUnit:
    content: str
    memory_type: str  # "action", "observation", "thought", "user_input"
    timestamp: datetime
    metadata: dict
    embedding: list[float] | None = None

class EventMemory:
    def __init__(self, embedding_model, vector_store):
        self.embedder = embedding_model
        self.store = vector_store
    
    def store_event(self, unit: MemoryUnit):
        unit.embedding = self.embedder.embed(unit.content)
        self.store.upsert(
            id=str(uuid4()),
            vector=unit.embedding,
            metadata={
                "content": unit.content,
                "type": unit.memory_type,
                "timestamp": unit.timestamp.isoformat(),
                **unit.metadata
            }
        )
    
    def recall(self, query: str, k: int = 5, 
               type_filter: str | None = None,
               time_decay: bool = True) -> list[MemoryUnit]:
        query_embedding = self.embedder.embed(query)
        
        filters = {}
        if type_filter:
            filters["type"] = type_filter
        
        results = self.store.query(
            vector=query_embedding,
            k=k * 3 if time_decay else k,  # Over-retrieve for reranking
            filter=filters
        )
        
        if time_decay:
            # Boost recent memories
            now = datetime.now()
            for r in results:
                age_hours = (now - datetime.fromisoformat(
                    r.metadata["timestamp"]
                )).total_seconds() / 3600
                r.metadata["relevance_score"] = (
                    r.score * (1 / (1 + 0.1 * age_hours))
                )
            results.sort(key=lambda x: x.metadata["relevance_score"], reverse=True)
            results = results[:k]
        
        return results

Embedding Model Selection Matters

The embedding model isn't just a detail — it fundamentally shapes what your agent can retrieve. Models with different training data surface different kinds of semantic relationships.

Model Dimensions Good At Weak At
text-embedding-3-small (OpenAI) 1536 General text similarity Technical/specialized domains
text-embedding-3-large (OpenAI) 3072 Broad semantic matching Cost at scale
voyage-3 (Voyage AI) 1024 Code, technical docs Multilingual
e5-mistral-7b-instruct 4096 Instruction-following queries Latency-sensitive apps
nomic-embed-text-v1.5 768 Long documents (8K context) Nuanced semantic distinction

For agent memory specifically, I've found that instruction-tuned embedding models (like E5 variants) work better because agent queries tend to be natural language questions rather than keyword searches. The query "what database did the user say they're using?" performs very differently across embedding models.

Vector Store Choices

The vector store itself has more impact than most people realize:

# ChromaDB — easy local development, limited at scale
import chromadb
client = chromadb.PersistentClient(path="./agent_memory")
collection = client.get_or_create_collection(
    name="episodes",
    metadata={"hnsw:space": "cosine"}
)

# Pinecone — managed, good for production, vendor lock-in
import pinecone
index = pinecone.Index("agent-memory")

# Qdrant — self-hosted or managed, excellent filtering capabilities
from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)

# pgvector — if you already have PostgreSQL, this is the pragmatic choice
# No separate infrastructure, SQL-based filtering, good enough performance
# for most agent workloads

My recommendation: Start with pgvector if you have PostgreSQL. The ability to combine vector similarity with SQL WHERE clauses is invaluable for memory retrieval. You can do things like "find memories similar to this query AND from the last 7 days AND related to the 'deployment' project" without maintaining a separate metadata filtering layer.

-- pgvector gives you this, which is exactly what agent memory needs
SELECT content, memory_type, created_at,
       embedding <=> $1::vector AS distance
FROM agent_memories
WHERE project = 'deployment'
  AND created_at > NOW() - INTERVAL '7 days'
ORDER BY embedding <=> $1::vector
LIMIT 5;

Episodic Memory: The Agent's Autobiography

Episodic memory stores experiences — specific events, interactions, and their outcomes. This is distinct from generic long-term memory because episodes are structured sequences with temporal context, not isolated facts.

Why It Matters

An agent with episodic memory can answer questions like: "What happened the last time I tried to deploy this service?" or "Have I encountered this error before, and what fixed it?" This requires storing not just individual facts but coherent narratives of past experiences.

Implementation Pattern: Episode Graphs

@dataclass
class Episode:
    id: str
    start_time: datetime
    end_time: datetime | None
    goal: str
    steps: list[dict]  # Ordered list of actions and observations
    outcome: str  # "success", "failure", "partial", "abandoned"
    summary: str
    lessons: list[str]
    embedding: list[float]
    related_episodes: list[str]  # Links to other episode IDs

class EpisodicMemory:
    def __init__(self, vector_store, graph_store=None):
        self.store = vector_store
        self.graph = graph_store  # Optional: for episode relationships
        self.current_episode: Episode | None = None
    
    def start_episode(self, goal: str) -> Episode:
        self.current_episode = Episode(
            id=str(uuid4()),
            start_time=datetime.now(),
            end_time=None,
            goal=goal,
            steps=[],
            outcome="in_progress",
            summary="",
            lessons=[],
            embedding=[],
            related_episodes=[]
        )
        return self.current_episode
    
    def record_step(self, action: str, observation: str):
        if not self.current_episode:
            raise RuntimeError("No active episode")
        
        self.current_episode.steps.append({
            "timestamp": datetime.now().isoformat(),
            "action": action,
            "observation": observation
        })
    
    def end_episode(self, outcome: str, summary: str, lessons: list[str]):
        if not self.current_episode:
            raise RuntimeError("No active episode")
        
        self.current_episode.end_time = datetime.now()
        self.current_episode.outcome = outcome
        self.current_episode.summary = summary
        self.current_episode.lessons = lessons
        
        # Generate embedding from the full episode narrative
        narrative = self._episode_to_narrative(self.current_episode)
        self.current_episode.embedding = embed(narrative)
        
        # Find and link related past episodes
        related = self.store.query(
            vector=self.current_episode.embedding,
            k=5
        )
        self.current_episode.related_episodes = [r.id for r in related]
        
        # Persist
        self._store_episode(self.current_episode)
        
        # Extract and store semantic memories (facts learned)
        for lesson in lessons:
            store_as_semantic_memory(lesson, source=self.current_episode.id)
        
        self.current_episode = None
    
    def recall_similar_episodes(self, situation: str, k: int = 3) -> list[Episode]:
        """Find past episodes similar to the current situation."""
        embedding = embed(situation)
        results = self.store.query(vector=embedding, k=k)
        return [self._load_episode(r.id) for r in results]
    
    def _episode_to_narrative(self, ep: Episode) -> str:
        """Convert structured episode to natural language for embedding."""
        steps_text = "\n".join(
            f"  Action: {s['action']}\n  Result: {s['observation']}"
            for s in ep.steps
        )
        return f"""Goal: {ep.goal}
Outcome: {ep.outcome}
Steps:
{steps_text}
Summary: {ep.summary}
Lessons: {', '.join(ep.lessons)}"""

The Reflection Pattern

The most powerful use of episodic memory is reflection — periodically reviewing past episodes to extract higher-order insights. This is how agents actually learn from experience rather than just storing it.

async def reflect(memory: EpisodicMemory, recent_episodes: list[Episode]):
    """Periodic reflection over recent episodes to extract patterns."""
    
    episodes_narrative = "\n---\n".join(
        memory._episode_to_narrative(ep) for ep in recent_episodes
    )
    
    reflection = await llm_call(f"""Review these recent episodes and identify:

1. Common failure patterns (what keeps going wrong?)
2. Successful strategies (what approaches have worked?)
3. Environment facts learned (what did we discover about the system?)
4. Skill improvements (what should we do differently next time?)

Episodes:
{episodes_narrative}

Provide structured insights:""")
    
    # Store reflections as high-priority semantic memories
    for insight in parse_insights(reflection):
        store_as_semantic_memory(
            content=insight.content,
            source="reflection",
            importance="high",
            confidence=insight.confidence
        )

Generative Agents (Park et al., 2023) demonstrated this pattern effectively — their simulated agents would periodically reflect on their experiences, generating higher-level insights that influenced future behavior. The key insight was that reflection queries should be generated from the agent's recent experiences, not from user input.

Semantic Memory: Facts and Knowledge

Semantic memory stores facts, concepts, and relationships — knowledge that's been abstracted away from specific experiences. "The production database runs PostgreSQL 15" is semantic memory. "Last Tuesday I connected to the production database and ran a migration" is episodic.

Knowledge Graphs for Semantic Memory

While vector stores work for retrieval-augmented generation, structured semantic memory benefits enormously from knowledge graphs. The relationships between facts matter as much as the facts themselves.

from neo4j import GraphDatabase

class SemanticMemory:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
    
    def store_fact(self, subject: str, predicate: str, obj: str, 
                   confidence: float = 1.0, source: str = "observation"):
        with self.driver.session() as session:
            session.run("""
                MERGE (s:Entity {name: $subject})
                MERGE (o:Entity {name: $object})
                MERGE (s)-[r:RELATION {type: $predicate}]->(o)
                SET r.confidence = $confidence,
                    r.source = $source,
                    r.updated_at = datetime()
            """, subject=subject, predicate=predicate, object=obj,
                 confidence=confidence, source=source)
    
    def query_facts(self, entity: str, relation: str = None) -> list[dict]:
        with self.driver.session() as session:
            if relation:
                result = session.run("""
                    MATCH (s:Entity {name: $entity})-[r:RELATION {type: $relation}]->(o)
                    RETURN o.name AS object, r.confidence AS confidence
                """, entity=entity, relation=relation)
            else:
                result = session.run("""
                    MATCH (s:Entity {name: $entity})-[r:RELATION]->(o)
                    RETURN type(r) AS relation, o.name AS object, 
                           r.confidence AS confidence
                """, entity=entity)
            return [dict(r) for r in result]
    
    def find_path(self, entity_a: str, entity_b: str, max_depth: int = 3):
        """Find relationships between two entities — useful for reasoning."""
        with self.driver.session() as session:
            result = session.run("""
                MATCH path = shortestPath(
                    (a:Entity {name: $a})-[*..{max_depth}]-(b:Entity {name: $b})
                )
                RETURN path
            """, a=entity_a, b=entity_b, max_depth=max_depth)
            return result

Extraction Pipeline

The hard part is automatically extracting structured facts from unstructured agent interactions:

fact_extraction_prompt = """Extract factual statements from this interaction.
For each fact, provide:
- subject: the entity being described
- predicate: the relationship or property
- object: the value or related entity
- confidence: 0.0-1.0 based on how certain this fact is

Interaction:
{interaction}

Output as JSON array:
[{"subject": "...", "predicate": "...", "object": "...", "confidence": 0.9}]

Only extract facts that are explicitly stated or strongly implied. 
Do not infer facts from limited evidence."""

async def extract_and_store_facts(interaction: str, semantic_memory: SemanticMemory):
    response = await llm_call(fact_extraction_prompt.format(interaction=interaction))
    facts = json.loads(response)
    
    for fact in facts:
        # Check if this contradicts existing knowledge
        existing = semantic_memory.query_facts(
            fact["subject"], fact["predicate"]
        )
        
        if existing and existing[0]["object"] != fact["object"]:
            # Fact has changed — update with lower confidence in the new value
            # and log the contradiction
            semantic_memory.store_fact(
                subject=fact["subject"],
                predicate=f"{fact['predicate']}_updated",
                obj=fact["object"],
                confidence=fact["confidence"] * 0.8,
                source="observation_update"
            )
        else:
            semantic_memory.store_fact(**fact)

Framework Implementations

  • MemGPT / Letta: The most sophisticated memory architecture in open-source agents. Implements a tiered memory system inspired by operating system virtual memory — an "in-context" working memory (the context window), an "archival" memory (vector store), and a "recall" memory (conversation history with search). The key innovation is that the agent itself manages its memory through explicit memory management tools, deciding what to store, retrieve, and evict.
# MemGPT's approach: the agent has explicit memory management tools
MEMORY_TOOLS = {
    "core_memory_append": "Append to in-context working memory",
    "core_memory_replace": "Replace content in working memory",
    "archival_memory_insert": "Store information in long-term archival storage",
    "archival_memory_search": "Search archival memory with vector similarity",
    "conversation_search": "Search past conversation history",
}
  • LangGraph: Provides building blocks rather than a complete memory system. You compose BaseStore implementations with custom retrieval logic. This flexibility is powerful but means you're building the memory architecture yourself.

  • Zep: Purpose-built memory layer for agents. Handles automatic fact extraction, entity resolution, and temporal reasoning out of the box. Uses a knowledge graph internally.

Procedural Memory: Learning How to Do Things

Procedural memory is the most underexplored memory type in agent systems. It stores how to perform tasks — not what happened (episodic) or what's true (semantic), but the patterns of action that lead to successful outcomes.

Tool Use Patterns

The simplest form of procedural memory is remembering which tools and tool sequences work for which tasks:

@dataclass
class Procedure:
    name: str
    trigger_pattern: str  # Natural language description of when to use this
    steps: list[dict]     # Ordered tool calls with parameter templates
    success_rate: float
    last_used: datetime
    embedding: list[float]

class ProceduralMemory:
    def __init__(self):
        self.procedures: dict[str, Procedure] = {}
    
    def learn_procedure(self, task_description: str, 
                        successful_steps: list[dict],
                        outcome: str):
        """Extract a reusable procedure from a successful task execution."""
        
        # Check if a similar procedure already exists
        existing = self.find_procedure(task_description)
        
        if existing and outcome == "success":
            # Reinforce existing procedure
            existing.success_rate = (
                existing.success_rate * 0.8 + 0.2
            )
            existing.last_used = datetime.now()
        elif outcome == "success":
            # Create new procedure
            proc = Procedure(
                name=f"proc_{uuid4().hex[:8]}",
                trigger_pattern=task_description,
                steps=successful_steps,
                success_rate=1.0,
                last_used=datetime.now(),
                embedding=embed(task_description)
            )
            self.procedures[proc.name] = proc
    
    def find_procedure(self, task_description: str) -> Procedure | None:
        """Find a known procedure for this type of task."""
        task_embedding = embed(task_description)
        
        best_match = None
        best_score = 0.0
        
        for proc in self.procedures.values():
            score = cosine_similarity(task_embedding, proc.embedding)
            if score > 0.85 and score > best_score:
                best_match = proc
                best_score = score
        
        return best_match

Voyager: The Gold Standard for Procural Memory

The Voyager agent (Wang et al., 2023) in Minecraft is the best example of procedural memory in practice. It maintains a skill library — a collection of executable JavaScript programs that represent learned behaviors. When it encounters a new challenge, it first searches the skill library for relevant past solutions before attempting to write new code.

# Simplified version of Voyager's skill library
class SkillLibrary:
    def __init__(self, vector_store):
        self.skills = vector_store
    
    def store_skill(self, name: str, description: str, code: str):
        self.skills.upsert(
            id=name,
            vector=embed(description),
            metadata={
                "name": name,
                "description": description,
                "code": code,
                "success_count": 0,
                "failure_count": 0
            }
        )
    
    def retrieve_skills(self, task: str, k: int = 3) -> list[dict]:
        results = self.skills.query(vector=embed(task), k=k)
        return [
            {
                "name": r.metadata["name"],
                "description": r.metadata["description"],
                "code": r.metadata["code"],
                "reliability": (
                    r.metadata["success_count"] / 
                    max(1, r.metadata["success_count"] + r.metadata["failure_count"])
                )
            }
            for r in results
        ]

The key insight from Voyager: procedural memory should be executable, not just descriptive. Storing "use a for loop to iterate through items" as text is far less useful than storing the actual code that does it.

Putting It All Together: Architecture Patterns

No production agent uses just one memory type. Here's how they compose:

┌─────────────────────────────────────────────┐
│                  Agent Loop                  │
│                                              │
│  ┌────────────────────────────────────────┐  │
│  │        Context Window (Short-term)     │  │
│  │  System prompt + recent messages +     │  │
│  │  retrieved context from other stores   │  │
│  └──────────────┬─────────────────────────┘  │
│                 │                             │
│  ┌──────────────┴─────────────────────────┐  │
│  │         Memory Manager                 │  │
│  │  Decides what to store where,          │  │
│  │  what to retrieve, when to reflect     │  │
│  └──┬────┬────┬────┬────┬────────────────┘  │
│     │    │    │    │    │                     │
│  ┌──┴─┐┌─┴──┐┌┴──┐┌┴───┐┌┴──────┐          │
│  │Vec ││KG  ││Epi││Proc││Conv   │          │
│  │DB  ││    ││   ││    ││History│          │
│  └────┘└────┘└───┘└────┘└───────┘          │
│  Long-  Seman Episo Proc  Recall            │
│  term   tic   dic   edural Memory           │
└─────────────────────────────────────────────┘

The Retrieval Orchestrator Pattern

The most practical architecture I've seen uses a retrieval orchestrator that queries all memory stores and synthesizes results:

class MemoryOrchestrator:
    def __init__(self, long_term, episodic, semantic, procedural):
        self.long_term = long_term
        self.episodic = episodic
        self.semantic = semantic
        self.procedural = procedural
    
    async def retrieve_context(self, current_situation: str, 
                                current_goal: str) -> str:
        """Query all memory stores and build rich context."""
        
        # Parallel retrieval from all stores
        results = await asyncio.gather(
            self.long_term.recall(current_situation, k=3),
            self.episodic.recall_similar_episodes(current_situation, k=2),
            self.semantic.query_relevant_facts(current_situation),
            self.procedural.find_procedure(current_goal)
        )
        
        long_term_results, episodes, facts, procedure = results
        
        # Build structured context
        context_parts = []
        
        if facts:
            context_parts.append(
                "## Known Facts\n" + 
                "\n".join(f"- {f}" for f in facts)
            )
        
        if episodes:
            context_parts.append(
                "## Similar Past Experiences\n" +
                "\n".join(
                    f"- {ep.summary} (outcome: {ep.outcome}, "
                    f"lessons: {', '.join(ep.lessons)})"
                    for ep in episodes
                )
            )
        
        if procedure:
            context_parts.append(
                f"## Known Procedure: {procedure.name}\n"
                f"Steps: {json.dumps(procedure.steps, indent=2)}\n"
                f"Reliability: {procedure.reliability:.0%}"
            )
        
        if long_term_results:
            context_parts.append(
                "## Relevant Past Conversations\n" +
                "\n".join(f"- {r.content}" for r in long_term_results)
            )
        
        return "\n\n".join(context_parts)

Limitations and Open Problems

Being honest about what doesn't work yet:

1. Retrieval quality is the bottleneck. Vector similarity search is a blunt instrument. "Similar" doesn't mean "relevant." An agent searching for "how to fix the deployment error" might retrieve memories about deployments that have nothing to do with the current error. Hybrid search (combining vector similarity with keyword matching and metadata filtering) helps but doesn't solve this.

2. Memory consolidation is unsolved. Humans consolidate memories during sleep — strengthening important ones, discarding noise, integrating new information with existing knowledge. We don't have good automated equivalents. Periodic LLM-based reflection is a crude approximation.

3. Contradiction detection is fragile. When an agent learns that "the API endpoint changed from v1 to v2," it needs to invalidate or update the old memory. Current systems are terrible at this. You end up with contradictory memories that confuse the agent.

4. Evaluation is nearly impossible. How do you measure if an agent's memory system is "good"? There are no standard benchmarks for agent memory. You can measure retrieval precision/recall for vector stores, but that doesn't capture whether the agent actually uses its memory effectively.

5. Cost scales linearly with memory operations. Every memory store, retrieve, and reflect operation is an LLM call or an embedding computation. An agent that aggressively manages its memory might spend more on memory operations than on actual task execution.

What's Next

The field is moving fast. A few trends worth watching:

  • State-space models (Mamba, RWKV) offer linear-time sequence processing, potentially making very large working memories practical without quadratic attention costs.
  • Learned memory management — training models to decide what to memorize rather than using heuristics. Letta/MemGPT's approach of giving the agent explicit memory management tools is a step in this direction.
  • Structured memory representations that go beyond flat vector stores — combining the retrieval flexibility of vectors with the reasoning capabilities of knowledge graphs.
  • Memory sharing between agents — multi-agent systems where agents can access each other's memories, enabling specialization and knowledge accumulation across agent teams.

The bottom line: memory is where the real intelligence in AI agents will come from. The base model provides reasoning capability, but memory provides the raw material to reason about. Frameworks that treat memory as a solved problem — or worse, an afterthought — will produce agents that feel clever for one conversation and amnesiac by the next.

Keywords

AI agentresearch-agents
AI Agent Memory Systems: How Agents Remember, Learn, and Improve