After reviewing some observations and lessons from our recent state-of-the-art results in conversational memory we’ll take a close look at the types of questions on which our results, although state of the art, most lagged our performance on other types of questions. In particular, we’ll employ a memory for personalization distinct from the memory of conversations to lift performance significantly.
We recently demonstrated that a conversational memory is effectively necessary for reasonable performance of a personal assistant over moderately long conversational histories. In particular, we showed that even 500 turns of conversations are sufficient to overwhelm large context window language models such as Open AI’s GPT-4o and o3. In doing so, we established several state-of-the-art results on the LongMemEval benchmark.
This shows that giving GPT-4o a full “haystack” from the LongMemEval benchmark produced mediocre results versus our simplest, fastest use of memory, which retrieves only pertinent turns of conversation from memory. Our 79% here was a new state-of-the-art result on the benchmark.
The above also shows that a not quite so simple approach matched GPT-4o given the “oracle” subset (which is to say “perfect” information, which establishes an upper limit on performance). We matched that oracle level of performance straightforwardly, establishing another state-of-the-art result. (More details are in this blog post.)
Our even better result (which involved a slightly different memory and simpler chain of thought than suggested by the benchmark authors) is approaching the performance of o3, a thinking model, given the oracle subset. As you can see, performance in all cases was worst for the preference category of questions. That’s what we’ll pursue below...
Note the difference between o3 given the full haystack and o3 given the oracle subset. The closer memory can get to providing the oracle subset, the better, in terms of performance and tokens (i.e., latency and cost).
We see this pattern in many applications. The more you put into context the more you reduce effective performance. Anyone who has long complex conversations with language models knows that after a while the language model gets confused, forgets things, and starts apologizing for doing so.
Leveraging memory about conversations is similar to - but distinct from - simple retrieval-augmented generation (RAG). In RAG, there may be no conversation at all or, typically, the conversation is quite limited, often explicitly limited, such as in the approach introduced in MemGPT (where older turns roll-off into memory and with a rolling summary of prior conversation remaining in context).
In recent years, RAG has been applied where there is reference material (e.g., passages of documents). Conversational memory is distinct in several ways, such as each turn being taken by some actor at a specific time. Obviously, in many applications, you want both. Generally, you want even more.
On LongMemEval, we employed only conversational memory, as there are no documents involved in the benchmark. We found that sufficiently robust recall at the appropriate granularity was sufficient to set the state of the art. Our weakest performance was on a category of questions involving user preferences. The primary reason for this weakness is that preferences are learned across conversations and not readily indexed by terms occurring in questions. Preference questions beg for some knowledge of the user which may be extracted and abstracted from conversations.
The “preference” category of LongMemEval involves a mix of queries for which there are not specific answers but rubrics by which language model responses are evaluated. For example:
Here the rubric reflects inferences drawn from the 500 or so turns of conversation in a haystack. One of our models, by which we mean a combination of memory and language model prompting using that memory, produced the following response:
This was rightfully judged as inappropriate for several reasons. Primarily, it didn’t recommend anything despite the explicit request to do so. Secondarily, it did well at recognizing interest in cultural diversity, but it missed the nuance of intersecting cultural diversity with language skills expected in the rubric.
To satisfy such rubrics may require significant inference from the haystack of conversations. Asking the language model to do that from scratch (i.e., raw turns) is a tall order, particularly given the proof that language models perform poorly given the full haystack. Furthermore, each of our results used a simple prompt that does not reference preferences, interests, recommendation, and so on. Our best state-of-the-art results involved a chain of thought, but it did not encourage thinking about any such things in the course of responding.
As an experiment, we decided to add an additional form of memory. In general, memory may involve different types of memories. For example, the memory of passages of documents and the memory of conversations and turns thereof. Here, we introduced a 3rd memory about people. This memory is populated by extracting a biography as well as likes, dislikes, preferences, and interests from conversations.
There was a bit of experimentation involved, but I settled on extractions from full haystacks such as the following in this case:
The user appears to be a well-rounded individual with diverse interests and a keen curiosity about various topics. They have a strong interest in entrepreneurship and business, as evidenced by their engagement with podcasts like 'How I Built This' and 'Masters of Scale.' They are also environmentally conscious, seeking eco-friendly products and exploring renewable energy sources. The user is health-conscious, focusing on organizing medical expenses and exploring strength training exercises. They have a penchant for luxury fashion, as seen in their interest in high-end brands like Gucci and Chanel. The user is also culturally engaged, enjoying travel, cooking, and exploring different cuisines, particularly in Asia and Colombia. They are interested in language learning, particularly French and Spanish, and are keen on improving their social media engagement. The user values local businesses and has a strong sense of social responsibility, as seen in their views on prostitution and bullying. They are also interested in art, both as a consumer and a creator, and have a passion for gardening. The user is tech-savvy, utilizing tools like Garmin for cycling and exploring machine learning applications. They are also interested in gaming, particularly first-person shooters, and enjoy exploring new recipes and culinary experiences.
- Interested in eco-friendly products and renewable energy,
- Health-conscious and organized,
- Appreciates luxury fashion,
- Culturally engaged and enjoys travel,
- Interested in language learning,
- Values local businesses,
- Has a strong sense of social responsibility,
- Interested in art and gardening,
- Tech-savvy and interested in machine learning,
- Enjoys gaming and culinary experiences
Such extractions are a form of memory in that they are performed off-line given prior conversations. They are stored and used when appropriate. In this case, when a user asks a question, adding the profile of the asking user to the prompt.
We tried this out on the preference items in several phases. First, just the paragraph and subsequently both the paragraph and the more specific enumeration. The paragraph alone had little effect on performance. Indeed, its impact was slightly negative, perhaps due to diminished (overwhelmed) attention. Adding the enumeration had a slightly favorable impact.
Although we had intuition about the necessary next steps, I decided to “consult” one of the leading language models. The essence of the conversation involved:
Here are some redacted extracts from that conversation:
The language model gets very confused by the judgement here. It disagrees strongly, and inappropriately. So, I brought it up to speed on the dataset and the difference between the answering model and the judgement model. Once that was cleared up I provided several items judged as failing.
As you will see, we land up going back to placing the personal information after the conversational information and before the concluding chain of thought instructions. Here we strive to avoid unfairly focusing the prompt on preference questions from LongMemEval.1
The additional instruction, placed between the question and the personal information, led to degradation but exposed the key issue after putting the personal information between the conversational data and the final chain of thought instruction.
At this point the prompt is as follows: question, classification, conversation, personal information, chain of thought instruction.
The essence of the change was to elaborate “an array of your thoughts” to have thinking steps, such as the classification step, explicit scan of the conversational data, response planning, etc.
The language model became unhelpful at making further suggestions on its own here. We discussed how the chain of thought was not producing specific enough plans, grounded in the question, which was resulting in generic plans which miss the detailed aspects of personalization.
Eventually, I suggested that we add a bit more granularity to the chain of thought, which had been to output thoughts, pertinent parts of turns, inferences pertinent to responding, and, finally, the response.
It makes some suggestions for doing so which are off-base and I find it difficult to turn the conversation from several poor suggestions, so I modify the chain of thought manually.
At this point the model's accuracy on preference questions had improved from 63% to the mid-70s. With a bit more manual experimentation avoiding any instruction that might be specific to LongMemEval or queries like those in its preference category we reached 80% accuracy.
For example, the previous response shown above was now judged appropriate as follows:
The moral of this story is that there are different types of memories used differently in various contexts. Without a new memory of personal information, which involves extraction, storage, indexing, and retrieval, and without proper use of that information in context, it would be difficult to lift our results on LongMemEval preference questions above those reported in our state-of-the-art results using only conversational memory.
Eager to apply more sophisticated agentic memory to the largest conversational benchmark, LongMemEval, we discuss the benchmark, our approach, our somewhat disappointing state of the art findings, and the need for a more comprehensive benchmark for agentic memory than LongMemEval.
LongMemEval is highlighted as the premier benchmark for evaluating long-term memory, surpassing simple tasks with its complex requirements. Despite this, our RAG-like methods have achieved state-of-the-art results, suggesting that while LongMemEval is effective, it may not fully capture all aspects of memory, indicating a need for further benchmark development.
Emergence AI agents are revolutionizing cybersecurity by autonomously correlating vast telemetry data, detecting threats in real time, automating compliance monitoring, and orchestrating efficient SOC operations while reducing manual workloads and enhancing decision-making. By acting as tireless digital teammates, these agents empower organizations to build a scalable, resilient, and proactive security posture fit for today’s complex threat landscape.