Benchmarks

We evaluate MemHQ on LoCoMo, an academic long-conversation memory benchmark. The current published number is 61.0% overall on the full 1,540-question set, evaluated with the standard gpt-4o-mini judge under the published methodology.

Headline

Metric	Value
Overall accuracy	61.0%
Dataset	LoCoMo10 (10 conversations, 1,540 non-cat-5 questions)
Judge	`gpt-4o-mini`, temperature 0
Answer prompt	Terse (≤5–6 words)

Per-category breakdown

Single-hop    ████████████████████████████████████████  75.2%
Temporal      ████████████████████████████████          56.1%
Multi-hop     ███████████████████████████               47.9%
Open-domain   ██████████████████████████████████        59.7%

Category	Score	Questions
1 — Single-hop	75.2%	282
2 — Temporal	56.1%	321
3 — Multi-hop	47.9%	96
4 — Open-domain	59.7%	841

What moved the number

Three changes that landed in the current published stack:

Synthesizer model — switched the answer-synthesis call to gpt-4o-mini, which speaks the same dialect as the judge and lifts all four categories (particularly temporal).
Synthesizer prompt v2 — added a rule that forces the synth to compute concrete dates from cited memories rather than answering with relative phrases.
Historical-mode retrieval — expanded the regex that auto-enables "include superseded memories" mode for past-tense and temporal-arc questions.

A separate write-path experiment (extractor-side date grounding) was run, did not move the number, and was reverted. The reconciler-side and read-side temporal handling were sufficient.

Methodology

We use the standard LoCoMo evaluation protocol:

Setting	Value
Judge model	`gpt-4o-mini`, temperature 0.0
Judge prompt	The standard "be generous, same topic = CORRECT"
Answer format	Terse, ≤5–6 words
Excluded category	Cat 5 (adversarial), per the original protocol
Concurrency	8

No bench-only hacks. Every change in the published stack is a product improvement that ships with the API — there is no benchmark-mode configuration.

Reproducing

Members of the MemHQ team can reproduce the run in ~5 minutes against a pre-ingested snapshot, or ~5 hours cold from scratch. The full methodology — every prompt, every flag, the per-conversation results — lives in PERF-REPORT.md in the internal repo.

What we're working on

The largest remaining lever is open-domain (cat 4), which is 55% of the dataset's volume. Improvements there will move the headline number the most. Multi-hop (cat 3) is also a target — the underlying graph already supports the joins; the retriever just hasn't been tuned for chain-walking yet.

Benchmarks

Benchmarks

Headline

Per-category breakdown

What moved the number

Methodology

Reproducing

What we're working on

On this page