EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

⬇ agent context pack (.md) — machine-readable summary + sources for AI agents

Summary

Most AI agent benchmarks test performance in a frozen snapshot of the world — but the real world never stops changing. Software gets updated, command-line interfaces evolve, and people's preferences shift over time. EvoArena is a new benchmark suite that directly confronts this problem by modeling the world as a sequence of progressive updates, asking agents to adapt continuously rather than solve one-shot tasks. Paired with it is EvoMem, a novel "patch-based" memory paradigm that lets agents remember not just what they know, but how and why that knowledge changed — giving them the tools to reason about evolution itself.

How It Works

EvoArena organizes each evaluation environment into a chain of progressively evolving releases. It includes Terminal-Bench-Evo for evolving terminal workflows, SWE-Chain-Evo for evolving codebases, and PersonaMem-Evo for evolving user preferences, evaluating both forward adaptation to new changes and version compatibility with still-valid prior states. This formulation turns environmental change into a measurable capability: an agent must solve the current task, identify which updates matter, and avoid reusing behaviors tied to obsolete versions.

EvoMem addresses the core memory challenge head-on. EvoMem augments a standard memory system with an append-only patch history that records meaningful memory changes. Each patch stores the pre-update memory, post-update memory, update rationale, and supporting evidence from the triggering context — making memory evolution fully traceable. Rather than simply overwriting old facts with new ones (which can erase still-relevant prior states), EvoMem preserves the full trail of changes. This allows the agent to reason about why something changed and whether an older version of that knowledge might still apply in a given context.

Why It Matters

Current agents struggle on EvoArena, achieving an average accuracy of 39.6% across the three evolving domains — a stark reminder of how poorly today's systems handle non-stationary conditions. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%, respectively. Crucially, EvoMem improves chain-level accuracy by 3.7%, showing its ability to support agents in completing sequences of related tasks under continuous environment evolution. The mechanistic analysis is especially telling: EvoMem doesn't just boost scores — it demonstrably improves the quality of evidence captured inside the memory, indicating more complete preservation of evolving environment states rather than shallow performance gains.

Related Work

GAIA Benchmark: GAIA (General AI Assistants), developed by Meta and Hugging Face, evaluates fundamental agent capabilities including multi-step, tool-mediated tasks such as web navigation, retrieval, and synthesis — skills essential for coordinated agent workflows. EvoArena uses GAIA as a cross-benchmark validation for EvoMem.
LoCoMo: LoCoMo is a multimodal benchmark assessing long-term conversational memory in language models, featuring detailed event grounding, multi-session dialogues, and human verification. EvoMem's gains on LoCoMo show its value extends beyond explicitly evolving settings.
Evo-Memory (arXiv 2511.20857): A complementary streaming benchmark and framework for evaluating self-evolving memory in LLM agents, focusing on test-time learning where agents must retrieve, update, and refine knowledge across continuous task streams.
EvoMemBench (arXiv 2605.18421): Another recent benchmark evaluating agent memory from a self-evolving perspective, noting that backbone LLMs remain essentially stateless by default and cannot natively maintain persistent internal knowledge.
A-MEM: A-MEM (Agentic Memory for LLM Agents) is an open-source framework supporting continuous memory evolution and refinement through agent-driven, adaptive memory management.

Implementations

An official open-source implementation does exist. The authors have released the code at github.com/Aiden0526/EvoArena, the dataset collection on HuggingFace at huggingface.co/collections/Aiden0526/evoarena, and a project page at aiden0526.github.io/EvoArena. The related A-MEM framework is also openly available at github.com/agiresearch/A-mem for those building on agentic memory systems more broadly.

Applications

DevOps and SRE agents: Agents managing ever-changing terminal commands, scripts, and infrastructure workflows where API surfaces and tooling evolve continuously.
AI software engineering assistants: Coding agents (in the spirit of SWE-bench) that must track codebase refactors, API deprecations, and library updates across project lifetimes.
Personalized AI companions: Conversational agents that must track shifting user preferences, relationship context, and evolving personal situations over months or years.
Enterprise knowledge management: Agents grounded in organizational policies, compliance rules, or product specifications that are regularly updated — where knowing what changed and when is as important as knowing the current state.