The Architecture Problem

Agent Memory: The Constraint Nobody Talks About

Context windows and vector databases are solving the wrong problem — persistent, reliable agent memory remains genuinely unsolved

By Jakub Jirák Jan 7, 2027 8 min read

ai-agentsagent-architecturecontext-windowmemory-systemsagentic-ai

The memory problem in large language models became a minor cultural touchstone in 2023 and 2024, when users discovered that the AI they were talking to had no recollection of their last conversation. The discourse was framed around “ChatGPT doesn’t remember me,” and the response was persistent memory features that store user preferences, past conversations, and accumulated context across sessions. This is a useful product feature and completely misses the deeper problem.

The memory problem for autonomous agents is not about whether the agent remembers your preferred name. It is about how an agent maintains a coherent, reliable model of the world it is operating in, across the full duration of a complex task, without the information it is working from going stale, getting corrupted, contradicted, or simply falling outside the window it can attend to.

Start with the context window as it actually works. A frontier model in 2027 can attend to hundreds of thousands of tokens simultaneously — an enormous improvement over the 4,000-token window of 2022. This feels like it should solve the memory problem. It does not, for several reasons.

First, attending is not the same as remembering reliably. Even within the context window, information that appeared early in a long context is attended to less reliably than information that appeared recently. This is the “lost in the middle” phenomenon — when relevant information is positioned in the middle of a very long context, models tend to underweight it compared to information at the beginning and end. For an agent operating over a long task with critical constraints established early, this is a functional failure mode regardless of the nominal context length.

Second, context windows are expensive. A hundred-thousand-token context, filled on every inference call, adds up quickly across a multi-step agent task. Agents making fifty tool calls per task, each with a full context reload, consume context-window tokens at a rate that makes the economics of production deployment challenging. The industry response — selective context loading, retrieval-augmented architectures that pull in relevant information as needed — partially addresses the cost problem but introduces a new problem: the agent must decide what information is relevant to load, which requires knowing what it does not know, which is precisely the capability that the memory system is supposed to provide.

The vector database approach, which has become the de facto standard for agent memory in production systems, treats memory as a retrieval problem. Important information is embedded and stored; at inference time, the agent queries the memory store based on the current context and retrieves semantically similar information to include in its working context. This works well for factual recall of specific information that can be retrieved by similarity to a query. It works poorly for several important memory operations that agents need.

It works poorly for constraint enforcement. If an agent was told “do not send emails to customers without human approval,” that constraint needs to be active throughout the task, not retrieved when the agent happens to be thinking about email-related matters. Semantic retrieval is pull-based — the agent fetches what is relevant to its current operation. Constraints need to be push-based — surfaced proactively whenever relevant, regardless of whether the agent is currently querying in that direction. Building push-based constraint memory on top of pull-based retrieval architectures is non-trivial and most current deployments have not done it adequately.

Vector stores also handle time poorly. A fact that was true six months ago and has since changed is stored with full confidence. Retrieval returns it without indicating that it might be stale. For agents operating in dynamic environments — live financial data, current inventory levels, evolving regulatory requirements — relying on a memory store without explicit temporal tracking produces confident action on outdated information.

The more fundamental problem is what memory researchers call “episodic continuity” — the ability to maintain a coherent model of “what happened so far in this task” that is more than a collection of stored facts. Human task execution depends on a continuous narrative thread: I started here, I found this out, I made this decision, I hit this obstacle, I adjusted my approach, and this is where I am now. That narrative provides the context for interpreting new information and making current decisions.

LLM agents do not naturally have episodic continuity. Each inference call, even with a memory system, reconstructs the agent’s operational context from stored components rather than maintaining it as a continuous experience. The reconstruction is usually good enough for short tasks. For long tasks, the reconstruction tends to produce subtle inconsistencies — the agent “remembers” the outcomes of steps it took but not the reasoning behind them, which means it cannot adjust its approach intelligently when those earlier decisions prove wrong.

Several teams building long-running autonomous agents have addressed this by maintaining an explicit “agent journal” — a natural-language running summary of task progress that is injected into the agent’s context at each step. This is essentially an external episodic memory. It works reasonably well and is much simpler to implement than more sophisticated approaches. Its limitations are that the summary must be maintained by either the agent itself (which introduces errors when the agent’s self-account diverges from what it actually did) or by a separate monitoring system (which adds cost and complexity).

There is also the question of what I would call “world model coherence” — the agent’s internal representation of the state of the environment it is operating in. A well-functioning agent engaged in a task that involves multiple external systems needs to maintain a consistent model of those systems’ states: what changes it has already made, what the current state is, where there are known inconsistencies. When this model is incomplete or inconsistent, the agent takes actions based on a false picture of the world, producing errors that look like reasoning failures but are actually memory failures.

This problem is acute in coding agents. An agent modifying a codebase needs to maintain an accurate model of what it has already changed — which files have been modified, which tests are currently passing, which dependencies have been updated. If that model is incomplete, the agent will make contradictory changes, re-make changes it already made, or introduce bugs by not accounting for earlier modifications. The production engineering response has been to give agents explicit access to git diff and test output as ground-truth state queries rather than relying on the agent’s internal model — effectively replacing the agent’s memory with live system state. This works, but it only works for the subset of world state that can be queried in real time, which in most agentic tasks is not the majority.

The honest assessment of agent memory in 2027 is that we have several partial solutions, each covering a different part of the problem, none of them adequate for the full range of tasks that production deployments require. Long context windows help but have retrieval reliability and cost problems. Vector stores handle factual retrieval but not constraints, temporal dynamics, or episodic continuity. External journal systems help with episodic continuity but require careful engineering to maintain accurately. Live state queries provide reliable ground truth for queryable environment state but cannot cover the full operational context.

The deeper reason these solutions remain partial is that they are all analogies to human memory — retrieval, storage, recall — applied to a fundamentally different computational substrate. Human memory is not a database. It is a dynamic reconstruction system, constantly revising and recontextualizing past information in light of present context, maintaining narrative continuity not by storing a complete record but by maintaining schematic structures that guide retrieval and interpretation. Building an equivalent system for LLM agents would require architectural changes that go well below the level of prompting strategies and vector stores.

The teams making the most progress on this problem are working at the model architecture level, building agents that maintain persistent state representations across inference calls rather than reconstructing context from storage on each call. This is promising work, genuinely different from the retrieval-augmented approaches that dominate current practice, and probably the direction from which a real solution will come. It is also several years from production maturity.

In the meantime, the practical answer for production agent deployments is: design tasks to be short enough that context management is not a primary constraint, use explicit constraint injection rather than relying on the agent to remember its constraints, provide live-query access to system state wherever possible, and treat the agent’s self-reported model of task progress with appropriate skepticism. None of this is elegant. All of it works.

There is an additional memory failure mode specific to multi-session tasks — work that spans multiple agent sessions rather than running continuously in a single context window. A task that takes a week of real time, with the agent running for a few hours each day, must reconstruct its operational context at the start of each session from stored information. The quality of that reconstruction determines whether the agent can continue coherently from where it left off or whether it effectively starts fresh each session with only the information that was explicitly stored.

Most current session-boundary memory architectures store facts but not reasoning momentum. The agent can recall what it found out, but not the cognitive posture it had developed — the specific frame through which it was interpreting new information, the provisional hypotheses it was testing, the implicit prioritizations that had been guiding its choices. Restarting a complex analysis task with a fresh agent session that has access to all the stored facts but lacks the accumulated interpretive frame often produces work that is technically correct but incoherent at the level of strategic direction.

The most thoughtful practitioners working on long-running agent tasks have started treating memory management as a first-class engineering discipline within agent design — not a problem to be solved by better retrieval systems but a design constraint to be actively managed throughout the task lifecycle. This includes explicitly checkpointing the agent’s “reasoning state” at regular intervals (not just its factual findings), designing tasks with natural pause points where context can be refreshed without losing coherence, and building human review into the context-reconstruction process rather than relying on the agent to do it unassisted.

The comparison to distributed systems engineering is apt. A distributed system that does not carefully manage state across node failures and restarts will produce corrupted or inconsistent results. An agent system that does not carefully manage memory across context boundaries will produce work that drifts from its intended trajectory in ways that are hard to detect and expensive to correct. The engineering discipline for managing distributed system state took a decade to mature. The equivalent discipline for agent memory is beginning to develop and will probably take similar time to reach a reliable body of practice. In the meantime, the teams that treat it as a serious engineering problem rather than a product feature are producing significantly better outcomes.

Agent Memory: The Constraint Nobody Talks About

Claude Code: Keeping Claude honest with explicit acceptance criteria, not vibes

Google Gemini: First real comparison we ran between Gemini and the rest

Cursor: Right amount of automation before it becomes magical thinking

05-Generative-Engine-Optimisation-and-the-SaaS-Survival-Playbook

Developer Ergonomics: Setting Up Your Environment for Long Work Hours

HP Z32k G3 Reviewed: A 4K Monitor That Teaches Patience and Precision

How the Pentagon's Budget Quietly Shapes Every AI Tool You Use