The first benchmark that actually tests memory the way agents use it — not chat history, but dense machine-generated trajectories full of tool calls, API responses, and causal chains. Spoiler: everything underperforms.
Existing memory benchmarks test dialogue-centric recall — can a chatbot remember what a human said 300 turns ago? But real-world agents don't operate in dialogue. They generate a continuous stream of machine-generated interactions with environments: tool calls, structured API responses, code execution outputs, state observations.
These two regimes are fundamentally different. The paper argues we've been optimizing for the wrong evaluation target.
Human-to-agent conversation. Natural language. Subjective preferences. Sparse information density. Tests like LoCoMo, LongMemEval.
Agent-to-environment interaction. Machine-generated tokens. Dense objective facts. Causal dependencies between steps. Real tool outputs.
Every entry in AMA-Bench is a recorded agent trajectory paired with factual questions about it. A memory system is given the trajectory (or a compressed version), asked a question, and scored on whether its answer is correct. Simple idea — brutal in practice.
The benchmark covers two complementary data sources — one for realism, one for scale:
Actual agent recordings from six task categories. Questions written by human experts against authentic trajectories.
Programmatically generated at arbitrary lengths. Rule-verified QA enables controlled scaling experiments.
The headline finding: most existing memory systems underperform the long-context baseline. Errors from lossy compression and similarity-based retrieval compound over long trajectories. More memory machinery ≠ better recall.
The paper identifies two root causes that explain why bolting existing memory systems onto agentic workflows actually hurts performance. Click each to expand.
Summarization and chunking discard the causal structure agents depend on.
search(X), got result Y, then used Y to call filter(Y, Z), a summary might keep the final result but drop the chain of reasoning that produced it. The agent can no longer explain why it reached a conclusion, and multi-hop questions over the trajectory become unanswerable. Unlike dialogue — where a summary captures the gist — agentic trajectories contain dense, causally-linked facts where every link matters.
Embedding-based retrieval finds semantically related but causally irrelevant chunks.
Summarization compresses tokens but severs causal chains. Left: the full trajectory with live causal edges. Right: the summarized version — steps merged, links gone.
The paper proposes AMA-Agent with two mechanisms that directly target the identified failure modes. Instead of lossy compression, build a causality graph. Instead of pure similarity search, use tool-augmented hybrid retrieval.
Raw stream of actions, observations, tool calls, state changes
Nodes = objective information units · Edges = causal & temporal dependencies
Traverse causal links to find causally relevant context
Lexical matching for specific factual lookups
Merged, deduplicated context → LLM generates answer
The causality graph preserves what summarization destroys — the explicit causal chain between agent actions and outcomes. Hybrid retrieval solves the similarity mismatch by traversing causal edges rather than relying on embedding cosine distance. Together they address both failure modes simultaneously.
Three implications for practitioners building agentic systems with long-horizon memory:
Standard chunk-and-retrieve fails when the data is machine-generated trajectories.
The chain of actions→observations→decisions matters more than any individual fact.
Before adding memory infrastructure, try just… extending the context.