Emergence is The *New* New State of the Art in Agent Memory

Insights
June 19, 2025
June 18, 2025
Marc Pickett

Paul Haley

Prakhar Dixit

Jeremy Hartman

In a post Calvin Ku explains why LongMemEval is the best benchmark for long-term memory. We recommend reading his post in full, but the upshot is that LongMemEval goes far beyond simple needle-in-the-haystack tasks (which can be easily hacked) to tasks that require true long-term memory. Meanwhile, the folks at Zep and Letta have written convincingly that there is far more to memory than RAG.

Picture 1771126508, Picture

We strongly agree that there's more to memory than RAG, and that LongMemEval is the best long-term memory benchmark out there.  However, we've developed RAG-like methods that have achieved state of the art in LongMemEval_S. At a comparable median latency to Zep (3.2 s/item vs. our 5 s/item), we achieved 86% accuracy on LongMemEval, compared to Zep's 71.2%. Note that this is even better than Oracle GPT-4o's 82.4%, which is given only the relevant sessions. Since we agree there's more to memory than RAG, and we show that RAG-like methods have largely solved the best LTM benchmark currently out there, our conclusion is that the best benchmark currently out there still isn't capturing important aspects of memory.

How did we do it?

As proof of concept, we describe two simple models: EmergenceMem Simple and EmergenceMem Simple Fast. These achieve 82.4% and 79% accuracy, with median latencies of 7.12 and 3.59 seconds per item, respectively. Our internal model EmergenceMem Internal, about which we will release a high-level description in an upcoming post, achieves 86% accuracy at a median latency of 5 seconds.

To understand our methods, note that LongMemEval_S has 500 haystack-question pairs.  Each haystack has roughly 50 dated conversations or sessions, each with a sequence of turns alternating between an assistant and a user.  Given a haystack, a memory system can process it agnostic to the question.  After processing, the system is given the question and uses the processed haystack to answer the question.  Latency is measured for this latter process.  (In practice, one would process many questions off a single haystack, amortizing the cost of processing the haystack.)

To answer questions, our 82.4% model, EmergenceMem Simple, builds a basic RAG funnel with two subtle improvements. Neither of these are earth-shattering, but yield substantial improvements:

  1. Match on turns, but retrieve entire sessions, where a session's retrieval score is its Normalized Discounted Cumulative Gain (NDCG) of the turns in it after cross-encoder reranking. For example, if the top ranked turns, in order were from sessions, 37, 52, 37, 37, 29, 52, and 12, respectively, then session 37 will get a score based on the rankings (1/log2(1) + 1/log2(3) + 1/log2(4)).
  1. Given these conversations, prompt the model (gpt-4o-2024-08-06) to generate structured output relevant to the question, including thoughts, facts, turns, and (finally) the answer. This prompting encourages the model to reason through its inputs to make best use of the information.

EmergenceMem Simple Fast sacrifices some performance for latency improvements. In particular:

  1. Match on turns and retrieve turns. No reranker.
  1. Have two calls to the LLM. The first extracts key events, facts, etc., from the retrieved turns with respect to the question. The second answers the question given the output of the first call.

If we choose the number of retrieved turns as 20, we achieve both marginally better median latency (2.96 s / item) and better accuracy (76.75%) than Zep.

The Path to SOTA

We'd like to put our solution in the context of other approaches. (An overview of the results is given in the Results Table below.)

  • First off, we have a baseline that simply answers, "I don't know", which achieves 5.8% accuracy, probably being "correct" for the abstention questions where LongMemEval intentionally asks questions that are unanswerable from the context.
  • After this, we have Best Guess (18.8%), which ignores history, and simply attempts to guess the most likely answer in Family-Feud style. For example, for the question "Who graduated first, second and third among Emma, Rachel and Alex?", this method correctly guesses that "Emma graduated first, Rachel second, and Alex third.".
  • Naive RAG (52%) is our initial attempt at a turn-based RAG system. We mention this to point out that, although we ultimately achieved SOTA with a RAG-like system, not just any RAG will crack this benchmark. The LongMemEval paper itself offers an exploration of different RAG configurations, the highest achieving 72%.
  • We would also like to point out that a longer context window is not all you need. The context window of GPT-4o can easily fit all the sessions for a question in LongMemEval_S. Yet, Full context GPT-4o achieves only 60%-64%, depending on implementation details. (We assume differences in prompting accounts for the 3.6% difference in our version of Full Context GPT-4o vs. Zep's version.)
  • Zep (71.2%) is the prior state of the art.
  • While GPT o3, a strong reasoning model, can outperform Zep, given the full context, the latency is markedly worse.
  • EmergenceMem Simple Fast has comparable latency to Zep, but with significantly better accuracy.
  • Next, we have the "Accumulator", which is a variation of Chain of Note, in which the model is given the question beforehand, then processes the sessions, one at a time, and accumulates evidence relevant to answering the question. Note that the Accumulator's latency is linear in the number of sessions and is a miserable 111 seconds.
  • Oracle GPT-4o (82.4%) is the same as our Full-context GPT-4o, but is given only the "oracle" sessions, which are exactly those sessions necessary to answer the question. (Oracle methods rely on oracle information and are disqualified from top rankings.)
  • Next is our simple RAG-like method described above EmergenceMem Simple. Right on par with GPT-4o.
  • Finally, we come to our best internal method, EmergenceMem Internal. Note that we outperform even the Oracle GPT-4o and Full context GPT o3.
  • After this are two more oracle methods, both identical to their non-oracle counterparts with the exception that their input is only the oracle sessions.

Results Table

(Unless stated otherwise, the LLM is gpt-4o-2024-08-06.)

Method Accuracy Latency (s/item) Knowledge Update Multi Session Single Session Assistant Single Session Preference Single Session User Temporal Reasoning
"I don't know" 5.80% 0.00 6.41% 9.02% 0.00% 0.00% 8.57% 4.51%
Best Guess 18.80% 1.73 19.23% 10.53% 23.21% 13.33% 20.00% 25.56%
Naive RAG 52.00% 1.48 61.54% 36.84% 73.21% 23.33% 81.43% 51.88%
Full context GPT-4o (Zep) 60.20% 31.30 78.20% 44.30% 94.60% 20.00% 81.40% 45.10%
Full context GPT-4o (us) 63.80% 10.43 75.64% 47.37% 98.21% 13.33% 82.86% 60.15%
Zep 71.20% 3.20 83.30% 57.90% 80.40% 56.70% 92.90% 62.40%
Full context GPT o3 76.00% 19.23 78.21% 64.66% 98.21% 30.00% 92.86% 78.20%
EmergenceMem Simple Fast 79.00% 3.59 80.77% 70.68% 100.00% 46.67% 94.29% 76.69%
Accumulator 81.80% 111.82 87.18% 78.20% 94.64% 36.67% 97.14% 78.95%
Oracle GPT-4o 82.40% 1.35 85.90% 80.45% 100.00% 43.33% 95.71% 76.69%
EmergenceMem Simple 82.40% 7.12 79.49% 73.68% 100.00% 70.00% 95.71% 81.20%
EmergenceMem Internal 86.00% 5.00 83.33% 81.20% 100.00% 60.00% 98.57% 85.71%
Oracle Accumulator 88.60% 4.41 94.87% 81.95% 98.21% 70.00% 97.14% 87.22%
Oracle GPT o3 92.00% 5.50 92.31% 91.73% 100.00% 60.00% 98.57% 92.48%

Conclusion

We agree with Zep's intuition that structured memory, such as the knowledge graphs constructed by Graphiti, will be vital for really cracking memory. Calvin suggests that maybe Zep and Letta are RAG after all. Whether or not you agree with this assessment, our simple method is certainly RAGgedy. While LongMemEval did a great job steering memory evaluation from needle-in-the-haystack problems, we suspect further benchmark development will be necessary.

We've shown heavy memory architectures might be overkill for LongMemEval, but we believe they will be important when you have a large mix of documents, conversations, and tool results, especially where applications are not as easy as extractive QA (whether or not temporal aspects are involved). Like Zep, we have a flexible memory architecture which we have designed to address a wide variety of agentic use cases, including the obvious cases in document-centric RAG, multimedia, and conversations. This architecture is described further in a concurrent post.  It provides and allows for various embeddings, n-spaces, relationships, etc.. It combines embeddings and kNN techniques with relational databases and graph techniques.

Our next objective is a more enterprise-focused benchmark requiring memory, especially memory for crafting agents from enterprise documents and conversations with and among team members spanning the agentic lifecycle (i.e., years).  Importantly, this includes semantic and process memory, which are not strongly evaluated in LongMemEval.

More from the Journal

June 12, 2025

Agents Are Redefining Cybersecurity Resilience

Emergence AI agents are revolutionizing cybersecurity by autonomously correlating vast telemetry data, detecting threats in real time, automating compliance monitoring, and orchestrating efficient SOC operations while reducing manual workloads and enhancing decision-making. By acting as tireless digital teammates, these agents empower organizations to build a scalable, resilient, and proactive security posture fit for today’s complex threat landscape.

May 13, 2025

Benchmarking Agents-Creating-Agents: How LLM Choices Shape Performance, Scale, and Quality

An empirical study of how different Generative Foundation Model pairings impact agent creation, verification, and emergent system behaviors across 40 enterprise tasks.

May 6, 2025

Comparing LLMs for Planning and Code Generation in Data Science Agents

We benchmarked the latest LLMs from OpenAI, Anthropic, Deepseek, and Google within our Data Insights Agent framework to identify which delivers the most accurate, fastest, and most consistent insights.