Benchmarks · methodology

How we bench

The headline numbers and exactly how they were produced. Every input, every prompt, every output — checked into the repo so you can rerun them against your own workload.

Why we publish methodology

Memory benchmarks are easy to game — pick the right slice of LoCoMo, the right reranker, the right comparison baseline, and almost any number can be made to look good. We publish the full configuration so you can sanity-check whether the result applies to your workload.

LoCoMo

LoCoMo is a long-conversation memory benchmark — 10 multi-day dialogues with question-answer probes at every step. We run the full benchmark (N=1,540) using the standard published methodology with agpt-4o-mini judge.

SplitMemHQ scoreSource
Overall61.0%bench/output/locomo-full.json
Single-hop75.2%bench/output/locomo-full.json
Temporal56.1%bench/output/locomo-full.json
Multi-hop47.9%bench/output/locomo-full.json
Open-domain59.7%bench/output/locomo-full.json

Reproducibility

MemHQ scores 61.0% on LoCoMo using the standard published methodology. The full per-conversation, per-probe trace lives at bench/output/locomo-full.json in the repo, alongside the extractor + retriever + synthesizer prompts that produced it. The complete methodology and configuration write-up is in PERF-REPORT.md.

LongMemEval-S (retrieval-only)

LongMemEval probes single-fact retrieval over long conversation history. We score retrieval only — no answer generation — because the answer step confounds retrieval quality with the synthesis model. R@K is the fraction of probes where the gold memory appears in the top K retrieved.

MetricMemHQ (hybrid)Source
R@591.2%bench/output/longmemeval-s.json
R@1097.0%bench/output/longmemeval-s.json

Method footnote

Session-id granularity, recall_any@K, n=500 probes, deduped per session. We treat the gold session as found if any memory extracted from it is in the top K. The full run output lives at bench/output/longmemeval-s.json.

What we do not bench (yet)

  • End-to-end task accuracy across model classes. The synthesis model dominates the score; we hold it fixed atgemini-2.0-flash for cost-comparable apples-to-apples.
  • Multi-user contention. Single-user LoCoMo does not stress the RBAC + isolation paths that matter most for production deployments.
  • Cold-start latency. All numbers are warm-cache. P95 on a freshly restarted API node is roughly 20% higher.

Reproducing the numbers

The bench harness lives at bench/ inside the MemHQ repo. Each subdirectory is one benchmark with its own README, prompts, and golden outputs. To rerun:

cd bench
pnpm install

# LoCoMo, full benchmark
pnpm run locomo -- --split=full --model=google/gemini-2.0-flash-001

# LongMemEval-S, retrieval-only
pnpm run longmemeval -- --variant=s --mode=retrieval-only

Outputs land under bench/output/<run-name>.json with per-probe trace. The aggregator script bench/aggregate.tsturns those into the summary tables above.