SOTA on LongMemEval with RAG

Insights

August 13, 2025

June 18, 2025

Marc Pickett

Paul Haley

Prakhar Dixit

Jeremy Hartman

In a post Calvin Ku explains why LongMemEval is the best benchmark for long-term memory. We recommend reading his post in full, but the upshot is that LongMemEval goes far beyond simple needle-in-the-haystack tasks (which can be easily hacked) to tasks that require true long-term memory. Meanwhile, the folks at Zep and Letta have written convincingly that there is far more to memory than RAG.

We strongly agree that there's more to memory than RAG, and that LongMemEval is the best long-term memory benchmark out there. However, we've developed RAG-like methods that have achieved state of the art in LongMemEval_S. At a comparable median latency to Zep (3.2 s/item vs. our 5.65 s/item), we achieved 86% accuracy on LongMemEval, compared to Zep's 71.2%. Note that this is even better than Oracle GPT-4o's 82.4%, which is given only the relevant sessions. Since we agree there's more to memory than RAG, and we show that RAG-like methods have largely solved the best LTM benchmark currently out there, our conclusion is that the best benchmark currently out there still isn't capturing important aspects of memory.

How did we do it?

As proof of concept, we describe two simple models: EmergenceMem Simple and EmergenceMem Simple Fast. These achieve 82.4% and 79% accuracy, with median latencies of 7.12 and 3.59 seconds per item, respectively. Our internal model EmergenceMem Internal, described at a high-level in a related post, achieves 86% accuracy at a median latency of 5.65 seconds.

To understand our methods, note that LongMemEval_S has 500 haystack-question pairs. Each haystack has roughly 40 dated conversations or sessions, each with a sequence of turns alternating between an assistant and a user. Given a haystack, a memory system can pre-process it before knowing the question. After processing, the system is given the question and uses the processed haystack to answer the question. Latency is measured for this latter process. (In practice, one would answer many questions off a single haystack, amortizing the cost of processing the haystack.)

To answer questions, our 82.4% model, EmergenceMem Simple, builds a basic RAG funnel with two subtle improvements which yield substantial improvements:

Match on turns, but retrieve entire sessions, where a session's retrieval score is its Normalized Discounted Cumulative Gain (NDCG) of the turns in it after cross-encoder reranking. For example, if the top ranked turns, in order were from sessions, 37, 52, 37, 37, 29, 52, and 12, respectively, then session 37 will get a score based on the rankings (1/log2(1) + 1/log2(3) + 1/log2(4)).

Given these conversations, prompt the model (gpt-4o-2024-08-06) to generate a simple chain of thought before answering.

EmergenceMem Simple Fast sacrifices some performance for latency improvements. In particular:

Match on turns and retrieve turns. No reranking or filtering.

Have two calls to the LLM. The first extracts key events, facts, etc., from the retrieved turns with respect to the question. The second answers the question given the output of the first call.

We’ve released the source code for this method here. Note that if we reduce the number of retrieved turns from 42 to 20, we achieve both marginally better median latency (2.96 s / item) and better accuracy (76.75%) than Zep. (In the table below, this is EmergenceMem Simple Faster.)

‍

The Path to SOTA

We'd like to put our solution in the context of other approaches. (An overview of the results is given in the Results Table below.)

First off, we have a baseline that simply answers, "I don't know", which achieves 5.8% accuracy, probably being "correct" for the abstention questions where LongMemEval intentionally asks questions that are unanswerable from the context.

After this, we have Best Guess (18.8%), which ignores history, and simply attempts to guess the most likely answer. For example, for the question "Who graduated first, second and third among Emma, Rachel and Alex?", this method correctly guesses that "Emma graduated first, Rachel second, and Alex third.".

Naive RAG (52%) is a simple RAG system which retrieves turns. We mention this to point out that, although we ultimately achieved SOTA with RAG-like methods, not just any RAG will crack this benchmark. The LongMemEval paper itself offers an exploration of different RAG configurations, the highest achieving 72%.

We would also like to point out that a longer context window is not all you need. The context window of GPT-4o can easily fit all the sessions for a question in LongMemEval_S. Yet, Full context GPT-4o achieves only 60%-64%, depending on implementation details. (We assume differences in prompting accounts for the 3.6% difference in our version of Full Context GPT-4o vs. Zep's version.)

Zep (71.2%) is the prior state of the art.

While GPT o3, a strong reasoning model, can outperform Zep, given the full context, the latency is markedly worse.

EmergenceMem Simple Fast has comparable latency to Zep, but with significantly better accuracy. EmergenceMem Simple Faster is identical except we reduce the number of retrieved items from 42 to 20. This yields both better accuracy and marginally better latency than Zep.

Next, we have the Accumulator, which is a variation of Chain of Note, in which the model is given the question beforehand, then processes the sessions, one at a time, and accumulates evidence relevant to answering the question. Note that the Accumulator's latency is linear in the number of sessions and is a miserable 111 seconds.

Oracle GPT-4o (82.4%) is the same as our Full-context GPT-4o, but is given only the "oracle" sessions, which are exactly those sessions necessary to answer the question. (Oracle methods rely on oracle information and are disqualified from top rankings.)

Next is our simple RAG-like method described above EmergenceMem Simple. Right on par with GPT-4o.

Finally, we come to our best internal method, EmergenceMem Internal. Note that we outperform even the Oracle GPT-4o and Full context GPT o3.

After this are two more oracle methods, both identical to their non-oracle counterparts with the exception that their input is only the oracle sessions.

‍

Results Table

Full results are shown below. Unless stated otherwise, the LLM is gpt-4o-2024-08-06

Method	Accuracy	Latency (s/item)	Knowledge Update	Multi Session	Single Session Assistant	Single Session Preference	Single Session User	Temporal Reasoning
"I don't know"	5.80%	0.00	6.41%	9.02%	0.00%	0.00%	8.57%	4.51%
Best Guess	18.80%	1.73	19.23%	10.53%	23.21%	13.33%	20.00%	25.56%
Naive RAG	52.00%	1.48	61.54%	36.84%	73.21%	23.33%	81.43%	51.88%
Full context GPT-4o (Zep)	60.20%	31.30	78.20%	44.30%	94.60%	20.00%	81.40%	45.10%
Full context GPT-4o (us)	63.80%	10.43	75.64%	47.37%	98.21%	13.33%	82.86%	60.15%
Zep	71.20%	3.20	83.30%	57.90%	80.40%	56.70%	92.90%	62.40%
Full context GPT o3	76.00%	19.23	78.21%	64.66%	98.21%	30.00%	92.86%	78.20%
EmergenceMem Simple Faster	76.80%	2.97	82.05%	63.91%	100.00%	63.33%	94.29%	70.68%
EmergenceMem Simple Fast	79.00%	3.59	80.77%	70.68%	100.00%	46.67%	94.29%	76.69%
Accumulator	81.80%	111.82	87.18%	78.20%	94.64%	36.67%	97.14%	78.95%
Oracle GPT-4o	82.40%	1.35	85.90%	80.45%	100.00%	43.33%	95.71%	76.69%
EmergenceMem Simple	82.40%	7.12	79.49%	73.68%	100.00%	70.00%	95.71%	81.20%
EmergenceMem Internal	86.00%	5.65	83.33%	81.20%	100.00%	60.00%	98.57%	85.71%
Oracle Accumulator	88.60%	4.41	94.87%	81.95%	98.21%	70.00%	97.14%	87.22%
Oracle GPT o3	92.00%	5.50	92.31%	91.73%	100.00%	60.00%	98.57%	92.48%

Conclusion

Calvin Ku suggests that maybe Zep and Letta are RAG after all. Whether or not you agree with his assessment, our simple method is largely dependent on RAG techniques and sufficient to establish a new state of the art on LongMemEval (even surpassing oracle performance).

As shown here and further discussed in a related blog post, advanced memory architecture appears to be overkill for LongMemEval. Thus, we are working towards a more enterprise-oriented benchmark.

‍