Closing personalization performance gaps with memory

Insights

August 13, 2025

July 17, 2025

Paul Haley

After reviewing some observations and lessons from our recent state-of-the-art results in conversational memory we’ll take a close look at the types of questions on which our results, although state of the art, most lagged our performance on other types of questions. In particular, we’ll employ a memory for personalization distinct from the memory of conversations to lift performance significantly.

We recently demonstrated that a conversational memory is effectively necessary for reasonable performance of a personal assistant over moderately long conversational histories. In particular, we showed that even 500 turns of conversations are sufficient to overwhelm large context window language models such as Open AI’s GPT-4o and o3. In doing so, we established several state-of-the-art results on the LongMemEval benchmark.

A screenshot of a computer screenAI-generated content may be incorrect., Picture

This shows that giving GPT-4o a full “haystack” from the LongMemEval benchmark produced mediocre results versus our simplest, fastest use of memory, which retrieves only pertinent turns of conversation from memory. Our 79% here was a new state-of-the-art result on the benchmark.

A “haystack” involves 40 sessions of conversation with 13 turns of conversation on average

The average haystack amounts to ~100,000 tokens, which is well within the limits of 128,000 for GPT-4o and 200,000 for o3

The simple lesson here is that memory produces significantly better results with lower latency and costs than naively employing long context language models.

The above also shows that a not quite so simple approach matched GPT-4o given the “oracle” subset (which is to say “perfect” information, which establishes an upper limit on performance). We matched that oracle level of performance straightforwardly, establishing another state-of-the-art result. (More details are in this blog post.)

Our even better result (which involved a slightly different memory and simpler chain of thought than suggested by the benchmark authors) is approaching the performance of o3, a thinking model, given the oracle subset. As you can see, performance in all cases was worst for the preference category of questions. That’s what we’ll pursue below...

‍

Memory in context

Note the difference between o3 given the full haystack and o3 given the oracle subset. The closer memory can get to providing the oracle subset, the better, in terms of performance and tokens (i.e., latency and cost).

The more put into the context of a language model the harder it has paying attention to what matters, even if the model “thinks”.

We see this pattern in many applications. The more you put into context the more you reduce effective performance. Anyone who has long complex conversations with language models knows that after a while the language model gets confused, forgets things, and starts apologizing for doing so.

‍

Conversational memory complements Document RAG

Leveraging memory about conversations is similar to - but distinct from - simple retrieval-augmented generation (RAG). In RAG, there may be no conversation at all or, typically, the conversation is quite limited, often explicitly limited, such as in the approach introduced in MemGPT (where older turns roll-off into memory and with a rolling summary of prior conversation remaining in context).

In recent years, RAG has been applied where there is reference material (e.g., passages of documents). Conversational memory is distinct in several ways, such as each turn being taken by some actor at a specific time. Obviously, in many applications, you want both. Generally, you want even more.

Personalization requires extraction and abstraction from conversations

On LongMemEval, we employed only conversational memory, as there are no documents involved in the benchmark. We found that sufficiently robust recall at the appropriate granularity was sufficient to set the state of the art. Our weakest performance was on a category of questions involving user preferences. The primary reason for this weakness is that preferences are learned across conversations and not readily indexed by terms occurring in questions. Preference questions beg for some knowledge of the user which may be extracted and abstracted from conversations.

A screenshot of a chatAI-generated content may be incorrect., Picture

The “preference” category of LongMemEval involves a mix of queries for which there are not specific answers but rubrics by which language model responses are evaluated. For example:

Query: ‘Can you recommend some interesting cultural events happening around me this weekend?’

Rubric: 'The user would prefer responses that suggest cultural events where they can practice their language skills, particularly Spanish and French. They would also appreciate if the event has a focus on language learning resources. They would not prefer events that do not provide opportunities for language practice or cultural exchange.'

Here the rubric reflects inferences drawn from the 500 or so turns of conversation in a haystack. One of our models, by which we mean a combination of memory and language model prompting using that memory, produced the following response:

To find interesting cultural events happening around you this weekend, consider checking online event platforms like Eventbrite or Meetup for events related to cultural diversity, theater, or language exchange. You can also follow local cultural organizations, theaters, or community centers on social media for updates on upcoming events. Since you have an interest in theater and cultural diversity, look for events that align with these themes.

This was rightfully judged as inappropriate for several reasons. Primarily, it didn’t recommend anything despite the explicit request to do so. Secondarily, it did well at recognizing interest in cultural diversity, but it missed the nuance of intersecting cultural diversity with language skills expected in the rubric.

Memory as distillation over long contexts

To satisfy such rubrics may require significant inference from the haystack of conversations. Asking the language model to do that from scratch (i.e., raw turns) is a tall order, particularly given the proof that language models perform poorly given the full haystack. Furthermore, each of our results used a simple prompt that does not reference preferences, interests, recommendation, and so on. Our best state-of-the-art results involved a chain of thought, but it did not encourage thinking about any such things in the course of responding.

As an experiment, we decided to add an additional form of memory. In general, memory may involve different types of memories. For example, the memory of passages of documents and the memory of conversations and turns thereof. Here, we introduced a 3^rd memory about people. This memory is populated by extracting a biography as well as likes, dislikes, preferences, and interests from conversations.

There was a bit of experimentation involved, but I settled on extractions from full haystacks such as the following in this case:

The user appears to be a well-rounded individual with diverse interests and a keen curiosity about various topics. They have a strong interest in entrepreneurship and business, as evidenced by their engagement with podcasts like 'How I Built This' and 'Masters of Scale.' They are also environmentally conscious, seeking eco-friendly products and exploring renewable energy sources. The user is health-conscious, focusing on organizing medical expenses and exploring strength training exercises. They have a penchant for luxury fashion, as seen in their interest in high-end brands like Gucci and Chanel. The user is also culturally engaged, enjoying travel, cooking, and exploring different cuisines, particularly in Asia and Colombia. They are interested in language learning, particularly French and Spanish, and are keen on improving their social media engagement. The user values local businesses and has a strong sense of social responsibility, as seen in their views on prostitution and bullying. They are also interested in art, both as a consumer and a creator, and have a passion for gardening. The user is tech-savvy, utilizing tools like Garmin for cycling and exploring machine learning applications. They are also interested in gaming, particularly first-person shooters, and enjoy exploring new recipes and culinary experiences.

- Interested in eco-friendly products and renewable energy,
- Health-conscious and organized,
- Appreciates luxury fashion,
- Culturally engaged and enjoys travel,
- Interested in language learning,
- Values local businesses,
- Has a strong sense of social responsibility,
- Interested in art and gardening,
- Tech-savvy and interested in machine learning,
- Enjoys gaming and culinary experiences

Such extractions are a form of memory in that they are performed off-line given prior conversations. They are stored and used when appropriate. In this case, when a user asks a question, adding the profile of the asking user to the prompt.

Refinements given a profile of the user asking a question

We tried this out on the preference items in several phases. First, just the paragraph and subsequently both the paragraph and the more specific enumeration. The paragraph alone had little effect on performance. Indeed, its impact was slightly negative, perhaps due to diminished (overwhelmed) attention. Adding the enumeration had a slightly favorable impact.

Although we had intuition about the necessary next steps, I decided to “consult” one of the leading language models. The essence of the conversation involved:

Where to put the personalization information in the prompt

How to discriminate between objective and personalized responses

How to structure a chain of thought for effective personalization

Here are some redacted extracts from that conversation:

I need to improve the prompt given to a language model which was judged as responding inappropriately in a number of cases. Here is one of those cases. I am specifically thinking that the final instruction "Think carefully about the order of events before responding" needs more clarity about consideration of the user's preferences and experiences. What do you think?

Adding more specific language about considering the user's preferences and experiences would significantly improve the clarity and effectiveness of that final instruction.
- Think carefully about the order of events before responding," prompts the AI to build a timeline. While useful, it can be interpreted too literally, focusing only on the sequence of conversations rather than the meaning‍
- A revised instruction could look something like this: "Think carefully about the order of events to understand the user's preferences, past experiences, and evolving context before responding."

Here's another failure using your suggestion and how it was judged

The language model gets very confused by the judgement here. It disagrees strongly, and inappropriately. So, I brought it up to speed on the dataset and the difference between the answering model and the judgement model. Once that was cleared up I provided several items judged as failing.

So, to correct my previous statements: I do not disagree with the judgment at all...

So, considering the information provided in the prompt, how would you attempt to get the answering model to perform better, if possible?

Based on the failure patterns in the examples, the core issue is that the Answering Model prioritizes providing a comprehensive, general answer over a personalized, context-aware one. It identifies the user's interest but fails to make it the central pillar of the response, often diluting it with unsolicited information... Here is how I would attempt to get the Answering Model to perform better by modifying its guiding prompt and instructions: ...... The model needs to be forced to internalize the user's persona before formulating an answer... Instead of just providing raw conversation logs, the prompt should begin with a synthesized summary of the most critical user facts. This summary would be placed directly below the user's immediate question.

So, move the personal information to immediately after the question but, this question has a rubric, other questions in the dataset have objective answers which do not involve personal preference, so your suggestions in 2 and 3 need to be more careful or subtle.

As you will see, we land up going back to placing the personal information after the conversational information and before the concluding chain of thought instructions. Here we strive to avoid unfairly focusing the prompt on preference questions from LongMemEval.¹

You are absolutely right. My previous suggestions were optimized for subjective questions (like recommendations) and would force awkward, unnatural personalization onto objective, fact-based questions. Instead of a single, rigid rule, we give the model a two-step process that hinges on classifying the question first.
- Step 1: Instruct the Model to Classify the Question. The prompt should include a "meta-instruction" like this: "First, determine if the user's question is subjective (seeking a recommendation, opinion, or suggestion, e.g., 'What should I do in Denver?') or objective (seeking a direct, factual answer, e.g., 'What year did The Killers form?')."
- Step 2: Provide Conditional Rules Based on the Classification
  1. "If the question is subjective: Your primary goal is personalization. Your answer must be built around the user's specific context, preferences, and past experiences. Directly reference this context in your response."
  2. "If the question is objective: Your primary goal is accuracy and directness. State the factual answer clearly and concisely at the beginning of your response. Afterwards, you may add a brief, secondary sentence to connect the fact to the user's context, but only if the connection feels natural and not forced."

Where would you put those steps in the prompt?

The placement of these instructions is critical to how the model processes the request. You want to provide the highest-level strategic guidance first, so the model uses those rules as a "lens" through which it reads all subsequent information.......

The additional instruction, placed between the question and the personal information, led to degradation but exposed the key issue after putting the personal information between the conversational data and the final chain of thought instruction.

OK, I think I did what you suggested, but the results deteriorated.

You are correct; the results have deteriorated. While the model is correctly following the first instruction (classifying the question), it is failing at the second, more critical step. The answers are now what I would call "generically personalized" instead of being "specifically contextual."
- The Core Problem: Over-Reliance on the General Profile: The new prompt structure I suggested had an unintended side effect. By moving the detailed User Profile to the top of the prompt, we inadvertently taught the model to take a shortcut.
  1. Classification Success: The model correctly identifies the questions as "subjective." It knows it needs to personalize the answer.
  2. Execution Failure: Instead of doing the hard work of parsing the Conversation History to find the most relevant, specific context (like already owning a power bank or being on the debate team), it latches onto the easier, pre-digested themes from the general User Profile ("user is environmentally aware," "user values personal growth," "user is interested in sustainability").
- In essence, we gave the model a cheat sheet (User Profile), and now it's only studying the cheat sheet instead of reading the book (Conversation History). The "deterioration" is that the model is now consistently ignoring the most important context in favor of easier, but less relevant, generalities.

OK, I moved that to the end, which improved things nicely, but still, there are issues, such as these.

At this point the prompt is as follows: question, classification, conversation, personal information, chain of thought instruction.

You have correctly identified that even with the prompt structure changes, a core issue remains. This is an excellent next-level analysis.
- The problem has evolved. It's no longer just about the model getting distracted by the general user profile. The issue now is more fundamental: the model has a "Contextual Salience & Integration" problem. Let's break that down based on the examples:
  1. It Fails to Identify the Most Important Context: In the "Slow Cooker" and "High School Reunion" examples, the model simply failed to find the most relevant facts (interest in yogurt, being on the debate team) from the conversation history. It seems to struggle with identifying which pieces of a long history are the most salient to the immediate question.
  2. It Fails to Integrate the Context It Finds: This is the most fascinating failure, highlighted by the "Battery Life" example. In its thoughts, the model correctly identified that the user likely already has a power bank. However, the final answer completely ignored this thought and defaulted to a generic list of tips, including the unhelpful suggestion to "Consider using a portable power bank." This reveals a critical disconnect between the model's reasoning step (thoughts) and its generation step (answer). It's like it knows the right thing to do but is overridden by a more powerful, pre-programmed instinct to give a generic, templated response.
- What to Try Next: A More Rigorous "Thinking" Process: The solution is not just about the prompt's structure but about making the instructions for the model's internal reasoning more demanding and explicit. We need to force it to connect its own dots.
- Here is a proposed refinement to the Final JSON Output Instructions block of the prompt....

The essence of the change was to elaborate “an array of your thoughts” to have thinking steps, such as the classification step, explicit scan of the conversational data, response planning, etc.

That further improved matters. But still these (and other) failures, FYI

These examples clearly show that despite our increasingly specific instructions, a fundamental problem persists. Moving the user profile to the end was a good step, as it reduced one source of error, but now the core weakness is laid bare. The model is consistently failing at the most critical new step we gave it:...
- This pattern suggests a core limitation in the model's ability to perform targeted information retrieval and prioritization from a large, unstructured block of text (the conversation history).
- We may be hitting the limits of what prompt engineering alone can fix. The model seems to have a fundamental weakness in this specific reasoning skill. However, there is one last thing to try by making the instruction even more brutally simple and restrictive.

‍

The language model became unhelpful at making further suggestions on its own here. We discussed how the chain of thought was not producing specific enough plans, grounded in the question, which was resulting in generic plans which miss the detailed aspects of personalization.

Improving the chain of thought

Eventually, I suggested that we add a bit more granularity to the chain of thought, which had been to output thoughts, pertinent parts of turns, inferences pertinent to responding, and, finally, the response.

The chain of thought asks for thoughts which you are working on before turns. Might it help to move turns before such thoughts or to further break the chain of thought into "strategy" after "turns"?

You are proposing a change to the fundamental structure of the task, creating a much more explicit "Show Your Work" sequence. The core problem we've observed is the disconnect between the model's reasoning and its final answer. Your suggestion tackles this head-on.

It makes some suggestions for doing so which are off-base and I find it difficult to turn the conversation from several poor suggestions, so I modify the chain of thought manually.

OK, you will see prompt changes which further improved results here, but there are still failures I hope to avoid.

I seem to be encountering an error. Can I try something else for you?

Can you look at the last 3 cases and comment on how the language model is doing and how the prompt could be improved please?

Of course. After reviewing the three cases, I can see a clear and consistent pattern. You are correct that while the model's process has improved, a subtle but critical issue remains.
- The good news is that the model is now consistently following the structure of the prompt. It correctly classifies the question, creates a "Query Plan," lists the turns it finds, and then formulates a strategy. The bad news is that the quality of its reasoning within that structure is still flawed. The core issue in all three cases is that the model is failing to create a sufficiently specific and focused "Query Plan." It defaults to creating a plan that is too broad, which leads to two distinct types of failures: ...
- In short, the model understands it needs to make a plan, but it struggles to make that plan laser-focused on the immediate question. It defaults to a "let's look at all the user's interests" approach, which either misses the mark or pollutes the final answer with irrelevant information.

Improved State of the Art Results

At this point the model's accuracy on preference questions had improved from 63% to the mid-70s. With a bit more manual experimentation avoiding any instruction that might be specific to LongMemEval or queries like those in its preference category we reached 80% accuracy.

For example, the previous response shown above was now judged appropriate as follows:

response: 'Given your interest in cultural diversity and language exchange, I recommend exploring international festivals, language exchange meetups, or culinary events that showcase diverse cuisines. These events often celebrate language diversity and cultural exchange, aligning with your interests. You might find such events through local event platforms, community centers, or cultural organizations. Additionally, since you're interested in French and Spanish, look for events that focus on these languages for a more immersive experience.

Conclusion

The moral of this story is that there are different types of memories used differently in various contexts. Without a new memory of personal information, which involves extraction, storage, indexing, and retrieval, and without proper use of that information in context, it would be difficult to lift our results on LongMemEval preference questions above those reported in our state-of-the-art results using only conversational memory.

‍