The LoCoMo (Long-Context Memory) benchmark is the most rigorous public evaluation for conversational AI memory systems. It tests five distinct capabilities that mirror real-world agent use: Temporal Reasoning, Open Domain question answering, Inferential reasoning, Single Hop retrieval, and Multi Hop retrieval. The benchmark has gained significant traction in 2026 — recent papers benchmark ten distinct memory approaches against the dataset, and a 2026 extension called LoCoMo-Plus pushes even further by testing cognitive memory under cue-trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. A companion benchmark, LongMemEval, has also emerged as a secondary standard: it extends LoCoMo with multimodal inputs and explicit memory operations, and results on both benchmarks are now cited together in major system comparisons.

Most memory systems are optimized for one or two of these domains. Vector databases excel at semantic similarity (Open Domain) but fail at temporal queries. BM25 systems handle keywords well but miss conceptual relationships. No system before Hypermemory had achieved state-of-the-art across all five domains simultaneously.

Temporal Reasoning (92% vs 61% baseline): This domain tests whether a system can answer questions like 'What did Alice say about her job last month?' or 'When did the project deadline change?' These require not just finding relevant memories, but understanding which version of a fact is current. The 15-point performance gap between architectures on temporal queries reflects a genuine structural divide: tools built on pure vector similarity cannot reliably answer 'what did the agent know last Tuesday?' Hypermemory's temporal supersession engine tracks when facts are overwritten — marking old facts as historical and promoting new ones — enabling precise answers to 'when' and 'what changed' queries.

Open Domain (89% vs 58% baseline): General question answering across arbitrary topics stored in memory. Hypermemory's semantic search using Qdrant vector embeddings handles this domain, with query expansion that broadens narrow queries and filters that narrow over-broad ones.

Inferential (87% vs 54% baseline): This is the hardest domain — requiring the system to draw conclusions not explicitly stated in memory. If the agent knows 'Alice is a software engineer' and 'Alice works at Anthropic', it should infer that Alice is a software engineer at an AI company. Hypermemory's implicit extraction layer infers personality traits, professional relationships, and contextual facts from stored memories.

Single Hop (94% vs 67% baseline): Direct retrieval of a specific stored fact. Hypermemory's BM25 + semantic fusion ensures that whether the query uses exact keywords or paraphrased language, the right fact surfaces. For context, Mem0's April 2026 token-efficient algorithm reaches 92.5% on LoCoMo (and 94.4% on LongMemEval) while averaging under 7,000 tokens per retrieval call — an improvement of +29.6 percentage points specifically on temporal queries vs. its prior version. ByteRover 2.0 (arXiv:2604.01599, April 2026) scores 92.2% overall via a local-first Context Tree architecture, with 94.4% temporal (best-in-class on that domain) and 95.4% single-hop. EverOS (EverMind EverCore) claims the current leaderboard top at 93.05% LoCoMo with <500ms p95 latency, framing itself as a 'Long-Term Memory Operating System.' MemMachine (arXiv:2604.04853, April 2026) achieves 91.69% using GPT-4.1-mini via a ground-truth-preserving approach that stores raw conversational episodes rather than relying on LLM-based fact extraction, avoiding hallucination at the extraction stage.

Multi Hop (88% vs 52% baseline): The most complex retrieval pattern — finding facts that require following a chain of relationships. 'Who introduced Alice to the team?' requires linking the introduction event → the person who made the introduction → their relationship to Alice. Hypermemory's graph traversal engine follows these chains explicitly, rather than hoping semantic similarity will bridge the gap.

The overall lesson from LoCoMo is that no single retrieval strategy dominates. The baseline systems (which use a single vector search) score in the 52–67% range. Hypermemory's 5-strategy fusion with RRF scoring consistently outperforms them by 27–36 percentage points across every domain. The frontier has compressed dramatically: by May 2026, the top systems — EverOS (93.05%), Mem0 (92.5%), ByteRover 2.0 (92.2%), and MemMachine (91.69%) — all cluster within 1.4 percentage points on LoCoMo. On the secondary LongMemEval benchmark, the competition extends further: OMEGA achieves 95.4% (466/500) using a local-first SQLite/ONNX approach, while the MemPalace system claims 96.6% via verbatim conversation storage (though this approach benchmarks retrieval from raw transcripts rather than structured extracted memory, raising comparability questions). The benchmark has matured from a diagnostic tool into a competitive leaderboard, and every percentage point at this level requires genuine architectural innovation — temporal indexing, graph traversal, and staleness management — rather than prompt tuning.

SOTA on LoCoMo: Breaking Down the Benchmark Results