Agentics: Memorizing Session Transcripts Isn't Useful
Keep track of artifacts, not scratch. Alt title: Claude, please stop trying to memorize random crap
We have found zero performance benefit on SWE tasks when agents have search access to their previous transcript sessions, provided they have access to other forms of context. We also have not found much benefit in trying to automatically trawl through session transcripts to improve agent context, unless there is a human in the loop.
This was pretty surprising.
Intuitively it feels like there's a lot of valuable information in a transcript between an agent and an engineer. Maybe it would have information about why the code exists, about user intent. Or it might have the other approaches that a user tried and discarded. At the least, it would have some amount of additional context that the agent could use to augment its understanding. I believed this so strongly that my company built an entire product around this concept. I used to tell folks that "session transcripts were the new oil," that they were more valuable than the code itself.
Other people have clearly had similar thoughts, which is why there are so many different tools to do session backed memory, including (of course) Claude Code itself.
I think the most common architecture is to do something like:
Store all transcripts across an organization in a DB
Put a vector search, an elastic search, or a SQL search layer in front of it. Ambitious teams will use all three. Maybe graphs will be involved.
Make this available to the agent using an MCP, or by exposing a cli with skills.
For us, this additional work doesn't seem to make a bit of difference. If anything, based on many months of testing with and without session search access, it may make the models worse.
Why might this be true?
One thing our team cares a lot about is coding artifacts. We don't really write code by hand anymore. In order to make PRs legible, we emphasize good commit messages, good pr messages, and comprehensive documentation. Every code change comes with extensive metadata that is committed alongside the code. When our agents do work on a piece of code, they are instructed to go look at the docs and the previous PRs.
In other words, the agent is already distilling all of the information that is valuable about a transcript, and storing it where it is needed and easily accessible. So when the agent uses a transcript search server, it ends up spending tokens reading things it already knows, while picking up all the stuff that the agent decided not to write down in the first place. Maybe, every now and then, there's some useful nugget of information in there. But most of the time, the agent is just looking at a pseudo nonsensical scratch pad and wasting precious tokens to do so.
The agents are also terrible at actually removing context, which is a critical capability for maintaining long term memory. I mean, across literal thousands of sessions, I've never seen it happen even once.
This is not a trait that can be removed with some clever prompt engineering. Agents don't have state, so they have to assume everything in their input context window is the ground truth. Every line of code, every existing bit of memory, every token is treated as an expression of intent -- even if that code or that memory was generated from a random decision made by some previous agent session, never reviewed or even understood by a human. This intent drift compounds the more the agent tries to autonomously build up a memory base.
As far as I am aware, there are zero coding benchmarks that assume the input data is corrupt. In fact, the models are penalized for assuming that the input data is wrong. This is partially an alignment issue too -- we don't want to have agents doing unintended things, and there isn't an easy way to thread the needle of “don't delete the codebase” and “do delete some of the input context.”
Since models can't actually garden their own memory, automatic memorization ends up in the same place: a load of garbage eating tokens, bloating bills, and degrading model quality.
Net net, I've become really bearish on tools that index and store and surface in session transcripts to an agent. The session transcript may be useful for team observability, but it won't make your agents better.
That doesn't mean that agents have no role in learning context over time. We use our internal nori bots to review everything that happened at the company each week across PRs, slack, drive, etc. And they then propose a set of changes to our built in nori skillsets, tagging the team in slack. These are all default rejected. In order to accept a change, you have to go in and actually look at the diff and make sure it fits the intent.
We accept less than 20% of these. Which means 80% of these “automatic” updates would've made the model worse. I can't imagine how much more unsustainable that would be if a multi-hundred person org were all saving these “updates” automatically all the time.



In this way it is most human.