1. Introduction

Large language model-based agents have demonstrated remarkable capability on short-horizon tasks, yet their utility in production environments is constrained by the finite nature of the transformer context window. Enterprise workflows routinely involve thousands of sequential decisions, cross-session continuity, and the need to reference observations made hours or days prior. Existing approaches — such as summarisation-based compression or retrieval-augmented generation — partially address this limitation but introduce information loss and retrieval latency that compounds over long task horizons.

We present a memory-augmented agent architecture in which the agent maintains a structured external memory store alongside its active context. Memory operations (write, read, update, delete) are exposed as first-class tools available to the agent at each step, allowing the model to manage its own memory budget and retrieval strategy. This design separates the concerns of working memory (the active context) from long-term memory (the persistent store), mirroring how humans offload information to external aids during complex tasks.

2. Architecture

The core of our system is a memory controller that sits between the agent runtime and the underlying LLM. At each inference step, the controller injects a compact memory summary into the system prompt and makes four memory tools available: mem_write(key, value, ttl?), mem_read(key), mem_search(query, top_k), and mem_forget(key).

The memory store is a hybrid structure combining a key-value index for deterministic lookups with a dense vector index for semantic retrieval. Writes are synchronous and durable; reads are cached at the session level to avoid redundant vector queries. The agent learns to use memory tools through standard instruction following — no fine-tuning is required, making the system compatible with any frontier model exposed via API.

Memory entries carry optional time-to-live (TTL) values, enabling the agent to express the expected relevance horizon of a given piece of information. Entries without TTLs persist indefinitely and are subject to a background compaction process that merges semantically similar records.

3. Experiments

We evaluated memory-augmented agents on three benchmarks: AgentBench-Extended (a 500-step multi-tool task suite), CrossSession-QA (a question-answering task requiring information from prior sessions), and EnterpriseWorkflow-1K (a proprietary internal benchmark of 1,000-step business process automations).

On AgentBench-Extended, our memory-augmented agent achieved a task completion rate of 78.4%, compared to 51.2% for a context-compression baseline and 34.7% for a vanilla agent with no memory mechanism. On CrossSession-QA, the memory-augmented agent answered 91.3% of questions correctly, versus 23.1% for the context-compression baseline (which cannot access prior sessions at all).

Token expenditure per completed task decreased by 62% on average relative to naive context accumulation, as agents learned to offload stable facts to memory rather than carrying them forward in the active window.

4. Emergent Planning Behaviours

A surprising finding of our study was the emergence of proactive memory management strategies not explicitly trained. Agents operating on long-horizon tasks began pre-emptively writing intermediate results to memory before they were needed downstream — a form of anticipatory offloading. In 34% of observed task traces, the agent wrote a memory entry that was later retrieved in a step where, without that entry, the task would have failed.

We also observed agents using mem_search to cross-reference current observations against historical patterns, identifying recurring error modes and self-correcting earlier than context-only counterparts. These behaviours suggest that access to persistent memory changes the agent’s effective problem-solving strategy, not merely its recall capacity.

5. Limitations and Future Work

The current system relies on the agent’s own judgment for memory management, which introduces the risk of unbounded memory growth in open-ended deployments. Future work will explore learned compaction policies and memory budget constraints that the agent must optimise within.

We also note that retrieval quality is sensitive to the agent’s query formulation — poorly specified mem_search queries can return irrelevant results, leading to context pollution. We are investigating retrieval-aware training signals that reward precise memory access patterns.

Finally, multi-agent settings, where multiple agents share a common memory store, present coordination challenges that remain unaddressed. We plan to extend this work to shared-memory multi-agent architectures in a follow-up paper.

6. Conclusion

We have presented a memory-augmented agent architecture that decouples working memory from model context, enabling coherent long-horizon task completion at substantially reduced token cost. The system requires no model fine-tuning and is compatible with any API-accessible frontier model. Emergent planning behaviours observed in our experiments suggest that persistent memory is not merely a storage mechanism but a qualitative shift in how agents approach complex, multi-step problems. We release our benchmark suite and evaluation harness to support further research in this direction.