How we bench
The headline numbers and exactly how they were produced. Every input, every prompt, every output — checked into the repo so you can rerun them against your own workload.
Why we publish methodology
Memory benchmarks are easy to game — pick the right slice of LoCoMo, the right reranker, the right comparison baseline, and almost any number can be made to look good. We publish the full configuration so you can sanity-check whether the result applies to your workload.
LoCoMo
LoCoMo is a long-conversation memory benchmark — 10 multi-day dialogues with question-answer probes at every step. We run the full benchmark (N=1,540) using the standard published methodology with agpt-4o-mini judge.
| Split | MemHQ score | Source |
|---|---|---|
| Overall | 61.0% | bench/output/locomo-full.json |
| Single-hop | 75.2% | bench/output/locomo-full.json |
| Temporal | 56.1% | bench/output/locomo-full.json |
| Multi-hop | 47.9% | bench/output/locomo-full.json |
| Open-domain | 59.7% | bench/output/locomo-full.json |
Reproducibility
bench/output/locomo-full.json in the repo, alongside the extractor + retriever + synthesizer prompts that produced it. The complete methodology and configuration write-up is in PERF-REPORT.md.LongMemEval-S (retrieval-only)
LongMemEval probes single-fact retrieval over long conversation history. We score retrieval only — no answer generation — because the answer step confounds retrieval quality with the synthesis model. R@K is the fraction of probes where the gold memory appears in the top K retrieved.
| Metric | MemHQ (hybrid) | Source |
|---|---|---|
| R@5 | 91.2% | bench/output/longmemeval-s.json |
| R@10 | 97.0% | bench/output/longmemeval-s.json |
Method footnote
recall_any@K, n=500 probes, deduped per session. We treat the gold session as found if any memory extracted from it is in the top K. The full run output lives at bench/output/longmemeval-s.json.What we do not bench (yet)
- End-to-end task accuracy across model classes. The synthesis model dominates the score; we hold it fixed at
gemini-2.0-flashfor cost-comparable apples-to-apples. - Multi-user contention. Single-user LoCoMo does not stress the RBAC + isolation paths that matter most for production deployments.
- Cold-start latency. All numbers are warm-cache. P95 on a freshly restarted API node is roughly 20% higher.
Reproducing the numbers
The bench harness lives at bench/ inside the MemHQ repo. Each subdirectory is one benchmark with its own README, prompts, and golden outputs. To rerun:
cd bench pnpm install # LoCoMo, full benchmark pnpm run locomo -- --split=full --model=google/gemini-2.0-flash-001 # LongMemEval-S, retrieval-only pnpm run longmemeval -- --variant=s --mode=retrieval-only
Outputs land under bench/output/<run-name>.json with per-probe trace. The aggregator script bench/aggregate.tsturns those into the summary tables above.