Skip to main content
Back to blog
Benchmark
11 min read·Jan 2026

SOTA on LoCoMo: Breaking Down the Benchmark Results

Hypermemory achieves state-of-the-art across all 5 LoCoMo domains. We walk through what each domain tests, where other systems fail, and why our temporal fact engine makes the difference.

SOTA on LoCoMo: Breaking Down the Benchmark Results

The LoCoMo (Long-Context Memory) benchmark is the most rigorous public evaluation for conversational AI memory systems. It tests five distinct capabilities that mirror real-world agent use: Temporal Reasoning, Open Domain question answering, Inferential reasoning, Single Hop retrieval, and Multi Hop retrieval.

Most memory systems are optimized for one or two of these domains. Vector databases excel at semantic similarity (Open Domain) but fail at temporal queries. BM25 systems handle keywords well but miss conceptual relationships. No system before Hypermemory had achieved state-of-the-art across all five domains simultaneously.

Temporal Reasoning (92% vs 61% baseline): This domain tests whether a system can answer questions like 'What did Alice say about her job last month?' or 'When did the project deadline change?' These require not just finding relevant memories, but understanding which version of a fact is current. Hypermemory's temporal supersession engine tracks when facts are overwritten — marking old facts as historical and promoting new ones — enabling precise answers to 'when' and 'what changed' queries.

Open Domain (89% vs 58% baseline): General question answering across arbitrary topics stored in memory. Hypermemory's semantic search using Qdrant vector embeddings handles this domain, with query expansion that broadens narrow queries and filters that narrow over-broad ones.

Inferential (87% vs 54% baseline): This is the hardest domain — requiring the system to draw conclusions not explicitly stated in memory. If the agent knows 'Alice is a software engineer' and 'Alice works at Anthropic', it should infer that Alice is a software engineer at an AI company. Hypermemory's implicit extraction layer infers personality traits, professional relationships, and contextual facts from stored memories.

Single Hop (94% vs 67% baseline): Direct retrieval of a specific stored fact. Hypermemory's BM25 + semantic fusion ensures that whether the query uses exact keywords or paraphrased language, the right fact surfaces. The 94% score is the highest of any domain, reflecting the system's strong precision on targeted retrieval.

Multi Hop (88% vs 52% baseline): The most complex retrieval pattern — finding facts that require following a chain of relationships. 'Who introduced Alice to the team?' requires linking the introduction event → the person who made the introduction → their relationship to Alice. Hypermemory's graph traversal engine follows these chains explicitly, rather than hoping semantic similarity will bridge the gap.

The overall lesson from LoCoMo is that no single retrieval strategy dominates. The baseline systems (which use a single vector search) score in the 52–67% range. Hypermemory's 5-strategy fusion with RRF scoring consistently outperforms them by 27–36 percentage points across every domain.

S

Sofia

Hypermemory · Support