EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

⬇ agent context pack (.md) — machine-readable summary + sources for AI agents

Summary

Most AI agent benchmarks test performance in a frozen snapshot of the world — but the real world never stops changing. Software gets updated, command-line interfaces evolve, and people's preferences shift over time. EvoArena is a new benchmark suite that directly confronts this problem by modeling the world as a sequence of progressive updates, asking agents to adapt continuously rather than solve one-shot tasks. Paired with it is EvoMem, a novel "patch-based" memory paradigm that lets agents remember not just what they know, but how and why that knowledge changed — giving them the tools to reason about evolution itself.

How It Works

EvoArena organizes each evaluation environment into a chain of progressively evolving releases. It includes Terminal-Bench-Evo for evolving terminal workflows, SWE-Chain-Evo for evolving codebases, and PersonaMem-Evo for evolving user preferences, evaluating both forward adaptation to new changes and version compatibility with still-valid prior states. This formulation turns environmental change into a measurable capability: an agent must solve the current task, identify which updates matter, and avoid reusing behaviors tied to obsolete versions.

EvoMem addresses the core memory challenge head-on. EvoMem augments a standard memory system with an append-only patch history that records meaningful memory changes. Each patch stores the pre-update memory, post-update memory, update rationale, and supporting evidence from the triggering context — making memory evolution fully traceable. Rather than simply overwriting old facts with new ones (which can erase still-relevant prior states), EvoMem preserves the full trail of changes. This allows the agent to reason about why something changed and whether an older version of that knowledge might still apply in a given context.

Why It Matters

Current agents struggle on EvoArena, achieving an average accuracy of 39.6% across the three evolving domains — a stark reminder of how poorly today's systems handle non-stationary conditions. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%, respectively. Crucially, EvoMem improves chain-level accuracy by 3.7%, showing its ability to support agents in completing sequences of related tasks under continuous environment evolution. The mechanistic analysis is especially telling: EvoMem doesn't just boost scores — it demonstrably improves the quality of evidence captured inside the memory, indicating more complete preservation of evolving environment states rather than shallow performance gains.

Related Work

Implementations

An official open-source implementation does exist. The authors have released the code at github.com/Aiden0526/EvoArena, the dataset collection on HuggingFace at huggingface.co/collections/Aiden0526/evoarena, and a project page at aiden0526.github.io/EvoArena. The related A-MEM framework is also openly available at github.com/agiresearch/A-mem for those building on agentic memory systems more broadly.

Applications

Sources

  1. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments (arXiv)
  2. EvoArena – Full HTML paper (arXiv)
  3. EvoArena – Hugging Face Paper Page
  4. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory (arXiv)
  5. EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective (arXiv)
  6. GAIA: The LLM Agent Benchmark (Towards Data Science)
  7. LoCoMo: Long-Term Conversational Benchmark (Emergent Mind)
  8. A-MEM: Agentic Memory for LLM Agents (GitHub)