Benchmarks

MemHQ on LoCoMo — 61.0% overall, with per-category breakdown.

Benchmarks

We evaluate MemHQ on LoCoMo, an academic long-conversation memory benchmark. The current published number is 61.0% overall on the full 1,540-question set, evaluated with the standard gpt-4o-mini judge under the published methodology.

Headline

MetricValue
Overall accuracy61.0%
DatasetLoCoMo10 (10 conversations, 1,540 non-cat-5 questions)
Judgegpt-4o-mini, temperature 0
Answer promptTerse (≤5–6 words)

Per-category breakdown

Single-hop    ████████████████████████████████████████  75.2%
Temporal      ████████████████████████████████          56.1%
Multi-hop     ███████████████████████████               47.9%
Open-domain   ██████████████████████████████████        59.7%
CategoryScoreQuestions
1 — Single-hop75.2%282
2 — Temporal56.1%321
3 — Multi-hop47.9%96
4 — Open-domain59.7%841

What moved the number

Three changes that landed in the current published stack:

  1. Synthesizer model — switched the answer-synthesis call to gpt-4o-mini, which speaks the same dialect as the judge and lifts all four categories (particularly temporal).
  2. Synthesizer prompt v2 — added a rule that forces the synth to compute concrete dates from cited memories rather than answering with relative phrases.
  3. Historical-mode retrieval — expanded the regex that auto-enables "include superseded memories" mode for past-tense and temporal-arc questions.

A separate write-path experiment (extractor-side date grounding) was run, did not move the number, and was reverted. The reconciler-side and read-side temporal handling were sufficient.

Methodology

We use the standard LoCoMo evaluation protocol:

SettingValue
Judge modelgpt-4o-mini, temperature 0.0
Judge promptThe standard "be generous, same topic = CORRECT"
Answer formatTerse, ≤5–6 words
Excluded categoryCat 5 (adversarial), per the original protocol
Concurrency8

No bench-only hacks. Every change in the published stack is a product improvement that ships with the API — there is no benchmark-mode configuration.

Reproducing

Members of the MemHQ team can reproduce the run in ~5 minutes against a pre-ingested snapshot, or ~5 hours cold from scratch. The full methodology — every prompt, every flag, the per-conversation results — lives in PERF-REPORT.md in the internal repo.

What we're working on

The largest remaining lever is open-domain (cat 4), which is 55% of the dataset's volume. Improvements there will move the headline number the most. Multi-hop (cat 3) is also a target — the underlying graph already supports the joins; the retriever just hasn't been tuned for chain-walking yet.