ReasoningBank memory framework for AI agents: Agent memory and workflow automation
Most AI agents today suffer from a fundamental amnesia problem that breaks long horizon workflows. Because agents forget past failures and strategies, they repeat mistakes across tasks. The ReasoningBank memory framework for AI agents tackles this directly and turns forgetting into a learning signal.
ReasoningBank stores distilled reasoning as compact memory items. It uses memory retrieval, memory extraction, and memory consolidation to compress experience. As a result, agents access human readable title, description, and content entries, not raw action logs. This structure enables fast embedding based similarity search and cosine retrieval at test time.
Moreover, ReasoningBank treats failure as a teacher. An LLM as a judge labels trajectories as Success or Failure. Then the system extracts reusable strategies and checklists, and it consolidates them back into the JSON memory store. Therefore agents refine exploration without weight updates, which yields emergent test time learning dynamics.
In short, ReasoningBank provides workflow memory and trajectory memory that scale test time performance. For automation engineers and researchers, it offers a practical path to robust agents, because memory quality matters more than quantity.
ReasoningBank memory framework for AI agents: three-stage loop
ReasoningBank implements a concise three-stage memory loop. First, agents retrieve relevant experiences. Next, the system extracts reusable strategies. Finally, it consolidates distilled lessons back into storage.
Memory retrieval
Retrieval uses embedding-based similarity search and top-k retrieval. By default, k equals one to prioritize precision. The system searches a JSON store with precomputed embeddings and ranks items by cosine similarity. As a result, agents receive a single, high-quality memory item that fits the task context.
Key retrieval steps
- Compute embedding for current trajectory or prompt
- Query JSON store with vector similarity search using cosine distance
- Return top-k results, typically k=1 for optimal signal
Memory extraction
An LLM-as-a-Judge reviews trajectories and emits Success or Failure. Then the system extracts why a trajectory succeeded or failed. Therefore extraction compresses raw traces into human-interpretable strategies and checklists. These outputs reduce noise and enable recomposition across tasks.
Extraction produces
- A concise title capturing the strategy
- A short description highlighting the intent
- Content with procedural steps or reflective rules
Memory consolidation
Consolidation merges new items into the JSON store. It deduplicates, updates embeddings, and organizes items by metadata. Consequently, the memory bank evolves from simple checklists to adaptive self-reflections. Importantly, consolidation happens without model weight updates, enabling emergent test-time learning dynamics.
| Dataset | Model | No-memory (Success / Resolve) | With ReasoningBank / MaTTS (Success / Resolve) | Step efficiency (fewer steps) | Notes |
|---|---|---|---|---|---|
| WebArena (general) | Gemini-2.5-Pro | 46.7% SR (no-memory) | 56.3% SR (MaTTS + ReasoningBank) | N/A | MaTTS uses diverse exploration and test-time scaling |
| WebArena | Gemini-2.5-Flash | 40.5% SR (no-memory) | 48.8% SR (ReasoningBank) | up to 1.6 fewer steps | Clear improvement in both SR and efficiency |
| WebArena-Shopping | Gemini-2.5-Pro (scaling) | 54.5% SR (sequential) | 55.1% SR (parallel, k=5) | N/A | Parallel scaling slightly outperforms sequential |
| Mind2Web | Various | Baseline varies; no aggregate given | Notable gains in cross-domain settings | N/A | AWM can degrade in some settings; ReasoningBank improves cross-domain |
| SWE-Bench-Verified | Gemini-2.5-Pro | 54.0% resolve (no-memory) | 57.4% resolve (ReasoningBank) | N/A | Robust gains without model weight updates |
| SWE-Bench-Verified | Gemini-2.5-Flash | Baseline not specified | 38.8% resolve | 2.8 fewer steps | Efficiency gains observed with ReasoningBank |
MaTTS and the ReasoningBank memory framework for AI agents: test-time scaling
MaTTS integrates with ReasoningBank to enable adaptive test-time compute scaling. It orchestrates diverse exploration trajectories and uses their outcomes as contrastive signals. As a result, ReasoningBank gains stronger memories that guide future exploration.
MaTTS supports parallel scaling and sequential scaling modes for compute. In parallel scaling, the system launches multiple diverse trajectories concurrently. For example, parallel scaling with k=5 yields 55.1% success rate on WebArena-Shopping. Sequential scaling runs trajectories one after another and achieved 54.5% in the same setup.
The pipeline creates a feedback loop of self-contrast and self-refinement. First, diverse trajectories provide negative and positive examples for memory extraction. Then, the system consolidates distilled strategies into the JSON store via embedding updates. Consequently agents reuse higher-quality memory items without changing model weights.
This loop yields emergent behaviors similar to reinforcement learning. Crucially, these behaviors appear entirely at test time and require no weight updates. For instance, MaTTS pushes WebArena success rates from 46.7% to 56.3% with Gemini-2.5-Pro. Therefore teams can improve agents through memory and compute orchestration instead of retraining.
Practically, MaTTS and ReasoningBank reduce step counts and increase resolve percentages. Moreover, they prioritize memory quality via top-k retrieval and embedding-based similarity search. As a result, automation engineers see faster, more robust workflows across domains.
CONCLUSION
ReasoningBank memory framework for AI agents closes the loop on agent amnesia. It extracts why actions succeed or fail, stores distilled strategies in a JSON store, and retrieves them via embedding-based similarity search. As a result, agents improve task resolve and step efficiency without retraining.
Crucially, failure becomes a learning signal. Because the system compresses experience into title, description, and content items, memory quality outperforms memory quantity. Therefore precise top-k retrieval and careful consolidation drive measurable gains across WebArena, Mind2Web, and SWE-Bench-Verified.
AI Generated Apps leads in delivering pragmatic AI automation and learning solutions. Explore their tools for workflow automation, memory-driven agents, and AI-powered learning systems. They help teams boost productivity and accelerate model-driven processes.
Call to action: Explore AI Generated Apps to prototype memory-enabled agents and automation pipelines. Website: aigeneratedapps.com Twitter/X: @aigeneratedapps Facebook: https://www.facebook.com/aigeneratedapps Instagram: aigeneratedapps
Frequently Asked Questions (FAQs)
What is ReasoningBank memory framework for AI agents and how does it solve agent amnesia?
ReasoningBank stores distilled reasoning as structured memory items. Each item contains a title, a description, and content so agents avoid raw action logs. The system places items in a JSON store with precomputed embeddings. Then it uses embedding based similarity search to return the most relevant memory. Therefore agents reuse prior solutions and avoid repeating the same mistakes.
How do the retrieval extraction consolidation stages work in practice?
- Retrieval uses embedding based similarity search and top k retrieval to find relevant items.
- Extraction runs an LLM as a judge to label trajectories with Success or Failure. It then distills why outcomes happened into human readable strategies.
- Consolidation merges new items into the JSON store, deduplicates entries, and recomputes embeddings to keep search fast.
What role does MaTTS play and what are parallel and sequential scaling modes?
MaTTS drives test time scaling and orchestrates diverse trajectories as contrastive signals. In parallel scaling the system runs multiple trajectories concurrently to explore diversity. In sequential scaling it runs trajectories one after another to refine search. MaTTS produces a feedback loop of self contrast and self refinement which strengthens memory quality. For example, MaTTS raises WebArena success rates from 46.7 percent to 56.3 percent with Gemini 2.5 Pro.
Does ReasoningBank require retraining the model to improve performance?
No. The framework improves agent behavior at test time without any model weight updates. Instead it leverages memory quality and retrieval precision to produce emergent learning like dynamics. Consequently teams can boost resolve and reduce step counts without expensive retraining.
How can teams adopt ReasoningBank for workflow automation?
Start by logging trajectories and running an LLM as a judge to label outcomes. Store distilled items in a JSON store with embeddings. Then apply top k retrieval and iterate with MaTTS to improve memory coverage. Finally measure success rate, resolve percentage, and step efficiency to validate gains.
AI Generated Apps AI Code Learning Technology