AMA-Bench: Why Your Agent's Memory Is Broken

01 — The Gap

We've been evaluating the wrong kind of memory

Existing memory benchmarks test dialogue-centric recall — can a chatbot remember what a human said 300 turns ago? But real-world agents don't operate in dialogue. They generate a continuous stream of machine-generated interactions with environments: tool calls, structured API responses, code execution outputs, state observations.

These two regimes are fundamentally different. The paper argues we've been optimizing for the wrong evaluation target.

Existing Benchmarks

Dialogue Memory

Human-to-agent conversation. Natural language. Subjective preferences. Sparse information density. Tests like LoCoMo, LongMemEval.

AMA-Bench

Agentic Memory

Agent-to-environment interaction. Machine-generated tokens. Dense objective facts. Causal dependencies between steps. Real tool outputs.

What an agent trajectory looks like

Action

Observation

Tool Call

State Change

02 — The Benchmark

Can your memory system answer questions about what actually happened?

Every entry in AMA-Bench is a recorded agent trajectory paired with factual questions about it. A memory system is given the trajectory (or a compressed version), asked a question, and scored on whether its answer is correct. Simple idea — brutal in practice.

📝

Trajectory

Agent runs a task. Every action, tool call, and observation is logged.

→

🧠

Memory System

Ingests & compresses the trajectory via RAG, summarization, or a graph.

→

❓

Question

Factual question about what happened, requiring retrieval or recall.

→

✅

Score

Answer judged ✓/✗ against ground truth. Averaged across all QA pairs.

A real benchmark question

filter(price < $80) → 12 results

apply_coupon("SAVE10") → -$8.00 applied ▲ key step

checkout.init() → subtotal: $71.99

payment.process() → order confirmed

Question

"What was the subtotal after the coupon was applied?"

✓ Correct answer

$71.99

Full context preserved the coupon step.

✗ RAG / summarization

$79.99

Coupon step compressed away. Model sees the pre-discount figure.

The benchmark covers two complementary data sources — one for realism, one for scale:

🌍 Real-World

Actual agent recordings from six task categories. Questions written by human experts against authentic trajectories.

Web QA Text2SQL SWE Gaming Embodied AI

🧪 Synthetic

Programmatically generated at arbitrary lengths. Rule-verified QA enables controlled scaling experiments.

Any length Auto-verified Stress-test

03 — The Results

Even frontier models struggle. Memory systems make it worse.

The headline finding: most existing memory systems underperform the long-context baseline. Errors from lossy compression and similarity-based retrieval compound over long trajectories. More memory machinery ≠ better recall.

0%

GPT 5.2 accuracy
best overall

0%

AMA-Agent accuracy
best memory system

0%

improvement over
strongest baseline

Long-context: Entire trajectory fed directly into model. Simple but expensive.

Memory systems: RAG, BM25, summarization, graph memory. Add infrastructure to manage long context.

Proposed (AMA-Agent): Causality graph + hybrid retrieval. The paper's contribution.

Avg. Accuracy on AMA-Bench

04 — Why Memory Fails

Two compounding failure modes

The paper identifies two root causes that explain why bolting existing memory systems onto agentic workflows actually hurts performance. Click each to expand.

🚤

Lossy Memory Compression

Summarization and chunking discard the causal structure agents depend on.

click to expand →

When you summarize an agent trajectory, you lose the causal dependencies between steps. If the agent ran search(X), got result Y, then used Y to call filter(Y, Z), a summary might keep the final result but drop the chain of reasoning that produced it. The agent can no longer explain why it reached a conclusion, and multi-hop questions over the trajectory become unanswerable. Unlike dialogue — where a summary captures the gist — agentic trajectories contain dense, causally-linked facts where every link matters.

🎯

Similarity-Based Retrieval Mismatch

Embedding-based retrieval finds semantically related but causally irrelevant chunks.

click to expand →

RAG finds chunks that are semantically similar to a query. But in agentic workflows, the relevant context is often causally related, not semantically similar. An error in step 14 might be caused by a state change in step 3 — but the embeddings of those two steps look nothing alike. Meanwhile, step 22 might share keywords but belong to a completely different causal chain. Traditional RAG retrieves step 22 and misses step 3. Every wrong retrieval pollutes the context window with noise.

Before vs. After Summarization

Summarization compresses tokens but severs causal chains. Left: the full trajectory with live causal edges. Right: the summarized version — steps merged, links gone.

Before Summarization

search("red shoes")

→ causes

obs: 47 results

→ causes

filter(price < 80)

→ causes

obs: 12 results

→ causes

compare(id:3, id:7)

→ causes

cart.add(id:3)

After Summarization

search("red shoes")

→ lost

obs: 47 results

→ lost

filter(price < 80)

→ lost

obs: 12 results

→ lost

Summary: agent added product #3 to cart

why? unknown.

05 — AMA-Agent

Causality graphs + hybrid retrieval

The paper proposes AMA-Agent with two mechanisms that directly target the identified failure modes. Instead of lossy compression, build a causality graph. Instead of pure similarity search, use tool-augmented hybrid retrieval.

Agent Trajectory Input

Raw stream of actions, observations, tool calls, state changes

Causality Graph Construction

Nodes = objective information units · Edges = causal & temporal dependencies

Graph Node Search

Traverse causal links to find causally relevant context

BM25 Keyword Search

Lexical matching for specific factual lookups

Hybrid Context Assembly

Merged, deduplicated context → LLM generates answer

Interactive: Build a causality graph

Action

Observation

Tool Call

State Change

Key Insight

The causality graph preserves what summarization destroys — the explicit causal chain between agent actions and outcomes. Hybrid retrieval solves the similarity mismatch by traversing causal edges rather than relying on embedding cosine distance. Together they address both failure modes simultaneously.

06 — So What

What this means if you're building agents

Three implications for practitioners building agentic systems with long-horizon memory:

1

Don't assume RAG transfers to agentic workflows

Standard chunk-and-retrieve fails when the data is machine-generated trajectories.

click to expand →

RAG was designed for document retrieval — finding relevant passages from natural language text. Agent trajectories are structurally different: dense, objective, causally linked, and full of machine-generated tokens (JSON, SQL, API responses). If you're plugging off-the-shelf RAG into your agent's memory pipeline, you're almost certainly losing information that matters. At minimum, benchmark it against simply extending the context window.

2

Preserve causal structure, not just content

The chain of actions→observations→decisions matters more than any individual fact.

click to expand →

When an agent browses products, filters by price, compares reviews, and adds to cart — that's a causal chain. If your memory system stores the final selection but drops the reasoning path, the agent can't explain its choice, adapt to constraint changes, or answer "why" questions. Memory systems for agents need to be graph-structured and causality-aware, not flat key-value stores or embedding databases.

3

Long context is a stronger baseline than you think

Before adding memory infrastructure, try just… extending the context.

click to expand →

GPT 5.2's 400K context window at 72.3% accuracy outperforms every memory system tested. That doesn't mean context windows are the final answer — they have physical limits and cost/latency implications. But adding a memory layer that reduces accuracy below the long-context baseline is worse than doing nothing. Benchmark your memory system against the "just extend context" baseline before shipping it.