Benchmarks
MemHQ on LoCoMo — 61.0% overall, with per-category breakdown.
Benchmarks
We evaluate MemHQ on LoCoMo, an academic long-conversation memory
benchmark. The current published number is 61.0% overall on the
full 1,540-question set, evaluated with the standard gpt-4o-mini
judge under the published methodology.
Headline
| Metric | Value |
|---|---|
| Overall accuracy | 61.0% |
| Dataset | LoCoMo10 (10 conversations, 1,540 non-cat-5 questions) |
| Judge | gpt-4o-mini, temperature 0 |
| Answer prompt | Terse (≤5–6 words) |
Per-category breakdown
Single-hop ████████████████████████████████████████ 75.2%
Temporal ████████████████████████████████ 56.1%
Multi-hop ███████████████████████████ 47.9%
Open-domain ██████████████████████████████████ 59.7%| Category | Score | Questions |
|---|---|---|
| 1 — Single-hop | 75.2% | 282 |
| 2 — Temporal | 56.1% | 321 |
| 3 — Multi-hop | 47.9% | 96 |
| 4 — Open-domain | 59.7% | 841 |
What moved the number
Three changes that landed in the current published stack:
- Synthesizer model — switched the answer-synthesis call to
gpt-4o-mini, which speaks the same dialect as the judge and lifts all four categories (particularly temporal). - Synthesizer prompt v2 — added a rule that forces the synth to compute concrete dates from cited memories rather than answering with relative phrases.
- Historical-mode retrieval — expanded the regex that auto-enables "include superseded memories" mode for past-tense and temporal-arc questions.
A separate write-path experiment (extractor-side date grounding) was run, did not move the number, and was reverted. The reconciler-side and read-side temporal handling were sufficient.
Methodology
We use the standard LoCoMo evaluation protocol:
| Setting | Value |
|---|---|
| Judge model | gpt-4o-mini, temperature 0.0 |
| Judge prompt | The standard "be generous, same topic = CORRECT" |
| Answer format | Terse, ≤5–6 words |
| Excluded category | Cat 5 (adversarial), per the original protocol |
| Concurrency | 8 |
No bench-only hacks. Every change in the published stack is a product improvement that ships with the API — there is no benchmark-mode configuration.
Reproducing
Members of the MemHQ team can reproduce the run in ~5 minutes against a
pre-ingested snapshot, or ~5 hours cold from scratch. The full
methodology — every prompt, every flag, the per-conversation results —
lives in PERF-REPORT.md in the internal repo.
What we're working on
The largest remaining lever is open-domain (cat 4), which is 55% of the dataset's volume. Improvements there will move the headline number the most. Multi-hop (cat 3) is also a target — the underlying graph already supports the joins; the retriever just hasn't been tuned for chain-walking yet.