In a post Calvin Ku explains why LongMemEval is the best benchmark for long-term memory. We recommend reading his post in full, but the upshot is that LongMemEval goes far beyond simple needle-in-the-haystack tasks (which can be easily hacked) to tasks that require true long-term memory. Meanwhile, the folks at Zep and Letta have written convincingly that there is far more to memory than RAG.
We strongly agree that there's more to memory than RAG, and that LongMemEval is the best long-term memory benchmark out there. However, we've developed RAG-like methods that have achieved state of the art in LongMemEval_S. At a comparable median latency to Zep (3.2 s/item vs. our 5 s/item), we achieved 86% accuracy on LongMemEval, compared to Zep's 71.2%. Note that this is even better than Oracle GPT-4o's 82.4%, which is given only the relevant sessions. Since we agree there's more to memory than RAG, and we show that RAG-like methods have largely solved the best LTM benchmark currently out there, our conclusion is that the best benchmark currently out there still isn't capturing important aspects of memory.
How did we do it?
As proof of concept, we describe two simple models: EmergenceMem Simple and EmergenceMem Simple Fast. These achieve 82.4% and 79% accuracy, with median latencies of 7.12 and 3.59 seconds per item, respectively. Our internal model EmergenceMem Internal, about which we will release a high-level description in an upcoming post, achieves 86% accuracy at a median latency of 5 seconds.
To understand our methods, note that LongMemEval_S has 500 haystack-question pairs. Each haystack has roughly 50 dated conversations or sessions, each with a sequence of turns alternating between an assistant and a user. Given a haystack, a memory system can process it agnostic to the question. After processing, the system is given the question and uses the processed haystack to answer the question. Latency is measured for this latter process. (In practice, one would process many questions off a single haystack, amortizing the cost of processing the haystack.)
To answer questions, our 82.4% model, EmergenceMem Simple, builds a basic RAG funnel with two subtle improvements. Neither of these are earth-shattering, but yield substantial improvements:
EmergenceMem Simple Fast sacrifices some performance for latency improvements. In particular:
If we choose the number of retrieved turns as 20, we achieve both marginally better median latency (2.96 s / item) and better accuracy (76.75%) than Zep.
The Path to SOTA
We'd like to put our solution in the context of other approaches. (An overview of the results is given in the Results Table below.)
Results Table
(Unless stated otherwise, the LLM is gpt-4o-2024-08-06.)
Conclusion
We agree with Zep's intuition that structured memory, such as the knowledge graphs constructed by Graphiti, will be vital for really cracking memory. Calvin suggests that maybe Zep and Letta are RAG after all. Whether or not you agree with this assessment, our simple method is certainly RAGgedy. While LongMemEval did a great job steering memory evaluation from needle-in-the-haystack problems, we suspect further benchmark development will be necessary.
We've shown heavy memory architectures might be overkill for LongMemEval, but we believe they will be important when you have a large mix of documents, conversations, and tool results, especially where applications are not as easy as extractive QA (whether or not temporal aspects are involved). Like Zep, we have a flexible memory architecture which we have designed to address a wide variety of agentic use cases, including the obvious cases in document-centric RAG, multimedia, and conversations. This architecture is described further in a concurrent post. It provides and allows for various embeddings, n-spaces, relationships, etc.. It combines embeddings and kNN techniques with relational databases and graph techniques.
Our next objective is a more enterprise-focused benchmark requiring memory, especially memory for crafting agents from enterprise documents and conversations with and among team members spanning the agentic lifecycle (i.e., years). Importantly, this includes semantic and process memory, which are not strongly evaluated in LongMemEval.
Emergence AI agents are revolutionizing cybersecurity by autonomously correlating vast telemetry data, detecting threats in real time, automating compliance monitoring, and orchestrating efficient SOC operations while reducing manual workloads and enhancing decision-making. By acting as tireless digital teammates, these agents empower organizations to build a scalable, resilient, and proactive security posture fit for today’s complex threat landscape.
An empirical study of how different Generative Foundation Model pairings impact agent creation, verification, and emergent system behaviors across 40 enterprise tasks.
We benchmarked the latest LLMs from OpenAI, Anthropic, Deepseek, and Google within our Data Insights Agent framework to identify which delivers the most accurate, fastest, and most consistent insights.