← Visual Notes
Zhao et al. · arXiv 2602.22769 · Mar 1, 2026

AMA-Bench: Why Your Agent's Memory Is Broken

The first benchmark that actually tests memory the way agents use it — not chat history, but dense machine-generated trajectories full of tool calls, API responses, and causal chains. Spoiler: everything underperforms.

Agent Memory Long-Horizon Benchmark RAG Causality Graphs
01 — The Gap
We've been evaluating the wrong kind of memory

Existing memory benchmarks test dialogue-centric recall — can a chatbot remember what a human said 300 turns ago? But real-world agents don't operate in dialogue. They generate a continuous stream of machine-generated interactions with environments: tool calls, structured API responses, code execution outputs, state observations.

These two regimes are fundamentally different. The paper argues we've been optimizing for the wrong evaluation target.

Existing Benchmarks

Dialogue Memory

Human-to-agent conversation. Natural language. Subjective preferences. Sparse information density. Tests like LoCoMo, LongMemEval.

AMA-Bench

Agentic Memory

Agent-to-environment interaction. Machine-generated tokens. Dense objective facts. Causal dependencies between steps. Real tool outputs.

What an agent trajectory looks like
Action
Observation
Tool Call
State Change
dense, causal, machine-generated
02 — The Benchmark
Can your memory system answer questions about what actually happened?

Every entry in AMA-Bench is a recorded agent trajectory paired with factual questions about it. A memory system is given the trajectory (or a compressed version), asked a question, and scored on whether its answer is correct. Simple idea — brutal in practice.

📝
Trajectory
Agent runs a task. Every action, tool call, and observation is logged.
🧠
Memory System
Ingests & compresses the trajectory via RAG, summarization, or a graph.
Question
Factual question about what happened, requiring retrieval or recall.
Score
Answer judged ✓/✗ against ground truth. Averaged across all QA pairs.
A real benchmark question
filter(price < $80) 12 results
apply_coupon("SAVE10") -$8.00 applied ▲ key step
checkout.init() subtotal: $71.99
payment.process() order confirmed
Question
"What was the subtotal after the coupon was applied?"
✓ Correct answer
$71.99
Full context preserved the coupon step.
✗ RAG / summarization
$79.99
Coupon step compressed away. Model sees the pre-discount figure.

The benchmark covers two complementary data sources — one for realism, one for scale:

🌍 Real-World

Actual agent recordings from six task categories. Questions written by human experts against authentic trajectories.

Web QA Text2SQL SWE Gaming Embodied AI

🧪 Synthetic

Programmatically generated at arbitrary lengths. Rule-verified QA enables controlled scaling experiments.

Any length Auto-verified Stress-test
two subsets, one verdict
03 — The Results
Even frontier models struggle. Memory systems make it worse.

The headline finding: most existing memory systems underperform the long-context baseline. Errors from lossy compression and similarity-based retrieval compound over long trajectories. More memory machinery ≠ better recall.

0%
GPT 5.2 accuracy
best overall
0%
AMA-Agent accuracy
best memory system
0%
improvement over
strongest baseline
Long-context: Entire trajectory fed directly into model. Simple but expensive.
Memory systems: RAG, BM25, summarization, graph memory. Add infrastructure to manage long context.
Proposed (AMA-Agent): Causality graph + hybrid retrieval. The paper's contribution.
Avg. Accuracy on AMA-Bench
why does memory fail?
04 — Why Memory Fails
Two compounding failure modes

The paper identifies two root causes that explain why bolting existing memory systems onto agentic workflows actually hurts performance. Click each to expand.

🚤

Lossy Memory Compression

Summarization and chunking discard the causal structure agents depend on.

click to expand →
When you summarize an agent trajectory, you lose the causal dependencies between steps. If the agent ran search(X), got result Y, then used Y to call filter(Y, Z), a summary might keep the final result but drop the chain of reasoning that produced it. The agent can no longer explain why it reached a conclusion, and multi-hop questions over the trajectory become unanswerable. Unlike dialogue — where a summary captures the gist — agentic trajectories contain dense, causally-linked facts where every link matters.
🎯

Similarity-Based Retrieval Mismatch

Embedding-based retrieval finds semantically related but causally irrelevant chunks.

click to expand →
RAG finds chunks that are semantically similar to a query. But in agentic workflows, the relevant context is often causally related, not semantically similar. An error in step 14 might be caused by a state change in step 3 — but the embeddings of those two steps look nothing alike. Meanwhile, step 22 might share keywords but belong to a completely different causal chain. Traditional RAG retrieves step 22 and misses step 3. Every wrong retrieval pollutes the context window with noise.
Before vs. After Summarization

Summarization compresses tokens but severs causal chains. Left: the full trajectory with live causal edges. Right: the summarized version — steps merged, links gone.

Before Summarization
search("red shoes")
→ causes
obs: 47 results
→ causes
filter(price < 80)
→ causes
obs: 12 results
→ causes
compare(id:3, id:7)
→ causes
cart.add(id:3)
After Summarization
search("red shoes")
→ lost
obs: 47 results
→ lost
filter(price < 80)
→ lost
obs: 12 results
→ lost
Summary: agent added product #3 to cart
why? unknown.
a better approach
05 — AMA-Agent
Causality graphs + hybrid retrieval

The paper proposes AMA-Agent with two mechanisms that directly target the identified failure modes. Instead of lossy compression, build a causality graph. Instead of pure similarity search, use tool-augmented hybrid retrieval.

Agent Trajectory Input

Raw stream of actions, observations, tool calls, state changes

Causality Graph Construction

Nodes = objective information units · Edges = causal & temporal dependencies

Graph Node Search

Traverse causal links to find causally relevant context

BM25 Keyword Search

Lexical matching for specific factual lookups

Hybrid Context Assembly

Merged, deduplicated context → LLM generates answer

Interactive: Build a causality graph
Action
Observation
Tool Call
State Change
Key Insight

The causality graph preserves what summarization destroys — the explicit causal chain between agent actions and outcomes. Hybrid retrieval solves the similarity mismatch by traversing causal edges rather than relying on embedding cosine distance. Together they address both failure modes simultaneously.

06 — So What
What this means if you're building agents

Three implications for practitioners building agentic systems with long-horizon memory:

1

Don't assume RAG transfers to agentic workflows

Standard chunk-and-retrieve fails when the data is machine-generated trajectories.

click to expand →
RAG was designed for document retrieval — finding relevant passages from natural language text. Agent trajectories are structurally different: dense, objective, causally linked, and full of machine-generated tokens (JSON, SQL, API responses). If you're plugging off-the-shelf RAG into your agent's memory pipeline, you're almost certainly losing information that matters. At minimum, benchmark it against simply extending the context window.
2

Preserve causal structure, not just content

The chain of actions→observations→decisions matters more than any individual fact.

click to expand →
When an agent browses products, filters by price, compares reviews, and adds to cart — that's a causal chain. If your memory system stores the final selection but drops the reasoning path, the agent can't explain its choice, adapt to constraint changes, or answer "why" questions. Memory systems for agents need to be graph-structured and causality-aware, not flat key-value stores or embedding databases.
3

Long context is a stronger baseline than you think

Before adding memory infrastructure, try just… extending the context.

click to expand →
GPT 5.2's 400K context window at 72.3% accuracy outperforms every memory system tested. That doesn't mean context windows are the final answer — they have physical limits and cost/latency implications. But adding a memory layer that reduces accuracy below the long-context baseline is worse than doing nothing. Benchmark your memory system against the "just extend context" baseline before shipping it.