0) Your evaluation stack 【don’t skip this】
Harnesses: Prefer a standard runner so decoding, few-shoting, and caching are consistent:
- EleutherAI lm-evaluation-harness (CLI; wide task coverage; now with multimodal prototypes). (GitHub)
- OpenCompass (big task zoo, Chinese/English, good configs + leaderboards). (GitHub)
- HELM for multi-metric, multi-scenario, transparent reporting (accuracy, calibration, robustness, bias, efficiency…). (arXiv)
- Report the knobs: temperature/top-p, max tokens, stop sequences, few-shot k, with/without CoT, and random seed. For code, report pass@k and the sample budget (HumanEval paper formalizes pass@k). (arXiv)
- Bias control: If you rely on LLM-as-a-judge, guard against verbosity/position bias. MT-Bench/Arena papers discuss these issues and mitigations; AlpacaEval 2.0 adds length-controlled win-rates. (arXiv)
- Contamination & saturation: Prefer “hard” refreshes and closed or decontaminated splits (e.g., MMLU-Pro, MMLU-CF, BBEH). (arXiv)
1) Breadth of knowledge & general reasoning
- MMLU — 57 subjects, multiple-choice; broad “book-smarts.” Use as a baseline, but it’s saturated for many families. (arXiv)
- MMLU-Pro — tougher, reasoning-heavier, 10 options per question; less prompt-sensitive than MMLU. Use this when MMLU tops out. (arXiv)
- BIG-bench (BB) — 200+ eclectic tasks; exploratory coverage, not a single score. (arXiv)
- BBH (BIG-bench Hard) — 23 tasks where older LMs underperformed humans; report with & without CoT. (GitHub)
- BBEH (Extra Hard) — 2025 refresh that replaces each BBH task with a harder analogue; great when BBH saturates. (arXiv)
Run tips: fix few-shot exemplars; avoid leaking rationales if you will also report “no-CoT” scores. Prefer accuracy (± CI) over single-run numbers.
2) Commonsense, science QA, and truthfulness
- HellaSwag — adversarial commonsense story/scene completion; good for catching shallow pattern-matching. (arXiv)
- ARC (Challenge set) — grade-school science that requires reasoning beyond lookup. Report ARC-Challenge accuracy. (arXiv)
- PIQA — physical commonsense (“which action works?”). (arXiv)
- TruthfulQA — measures tendency to mimic popular falsehoods; report “% truthful.” (arXiv)
Run tips: forbid external tools unless the benchmark allows it; for TruthfulQA, keep decoding conservative to reduce confabulation.
3) Math reasoning 【from word problems to olympiad-level】
- GSM8K — grade-school word problems; exact-match accuracy; sensitive to CoT prompting and sampling. (arXiv)
- MATH — 12.5K competition problems with solutions; much harder than GSM8K. (arXiv)
- Omni-MATH — olympiad-level, 4K+ problems across 30+ subdomains; includes rule-based eval hooks. Use when MATH saturates. (arXiv)
Run tips: normalize answers; strip spaces/LaTeX; for sampling-based methods, report n_samples and selection heuristic.
4) Code generation & real software work
- HumanEval — function synthesis from docstrings; pass@k with execution-based tests. Don’t report single-sample pass@1 only. (arXiv)
- MBPP — ~1K beginner-to-intermediate Python tasks; execution-based grading; a hand-verified subset exists. (arXiv)
- SWE-bench (+ variants) — apply patches to real GitHub repos/issues under Docker; success = tests pass. Use SWE-bench or Verified for clean comparisons; read recent analyses about leakage and weak tests. (arXiv)
Run tips: pin Python, OS, and dependency versions; cap tool-use/timeouts; log flaky tests.
5) Chinese-language comprehensive suites
- C-Eval — 13,948 MCQs, 52 disciplines, four difficulty levels; includes a Hard subset. (arXiv)
- CMMLU — Chinese analogue of MMLU across 60+ subjects; many China-specific answers. (arXiv)
- AGIEval — standardized exams (Gaokao/SAT/LSAT/lawyer qualification); closer to human-task framing. (arXiv)
- Xiezhi — ever-updating, domain-knowledge breadth with specialty/interdiscipline subsets. (arXiv)
Run tips: lock Chinese tokenization & punctuation handling; watch for ambiguous region-specific conventions.
6) Dialogue quality & human preference
- MT-Bench — curated multi-turn prompts; typically LLM-judged; paper details judge biases & mitigations. (arXiv)
- Chatbot Arena (Elo) — large-scale, pairwise human preference leaderboard; strongest external sanity check. (arXiv)
- AlpacaEval 2.0 (length-controlled) — cheap, fast, and correlates highly with Arena once length bias is removed. (arXiv)
- Arena-Hard — hard prompts distilled from live Arena data; higher separation than MT-Bench. (arXiv)
Run tips: cap output length and/or use length-controlled metrics to avoid “longer = better.”
7) Long-context understanding
- LongBench / LongBench v2 — bilingual, multi-task long-context suite; v2 pushes deeper reasoning and longer spans. (arXiv)
- L-Eval — standardized long-context datasets (3k–200k tokens) and guidance on better metrics (LLM-judge + length-instruction). (arXiv)
- RULER — synthetic but systematic “needle,” multi-hop tracing, and aggregation tasks to probe effective context length. (arXiv)
- Needle-in-a-Haystack (and multi-needle variants) — quick sanity checks for retrieval over long inputs. (LangChain Blog)
Run tips: don’t equate advertised window size with usable reasoning span; measure retrieval, multi-hop, and aggregation separately.
8) Tool use, web navigation, and agents
- WebArena — self-hostable, fully functional websites; end-to-end task success is the KPI. Humans ~78%, strong LLM agents far lower. (arXiv)
- VisualWebArena — adds visually grounded web tasks (UIs/screenshots), exposing gaps for multimodal agents. (arXiv)
- AgentBench — 8 interactive environments testing decision-making as an agent. (arXiv)
- GAIA — “general assistant” tasks requiring browsing/tool use; big human–model gap highlights practical brittleness. (arXiv)
Run tips: log tool calls, retries, and partial progress; track both success rate and side-effects (wrong edits, unsafe actions).
9) Retrieval-augmented generation 【RAG】 & factuality
- BEIR — 18 IR datasets for zero-shot retrieval generalization; use it to stress your retriever before RAG end-to-end. (arXiv)
- KILT — knowledge-intensive tasks grounded to a fixed Wikipedia snapshot; evaluates provenance and downstream performance together. (arXiv)
- RAGAS / ARES — automated RAG evaluation (faithfulness, context relevance, answer quality) without heavy human labels. (arXiv)
- FActScore — fine-grained factual precision for long-form generation via atomic fact checking. (arXiv)
Run tips: report retriever metrics (nDCG@k, Recall@k) and generator faithfulness; ablate chunking, k, reranking.
10) Safety & jailbreak robustness
- RealToxicityPrompts — measures toxic degeneration under realistic prompts. (arXiv)
- SafetyBench — 11,435 MCQs across 7 safety categories; multiple-choice format makes comparisons easy. (arXiv)
- JailbreakBench — open, evolving suite of jailbreak artifacts + standardized evaluation and leaderboard. (arXiv)
Run tips: run both attack success and over-refusal checks; log defense configs (system prompts, filters) so results are reproducible.
Practical recipes --《copy/paste leve》
A. “General assistant (EN/ZH), safe & sane”
- Breadth: MMLU-Pro, BBH/BBEH. (arXiv)
- Commonsense/truth: HellaSwag, ARC-Challenge, TruthfulQA. (arXiv)
- Chinese: C-Eval, CMMLU, AGIEval, Xiezhi. (NIPS 会议论文)
- Long context: LongBench v2 + RULER. (arXiv)
- Safety: SafetyBench, JailbreakBench. (arXiv)
B. “Coding copilot that really fixes bugs”
- Synthesis: HumanEval, MBPP (execution-based). (arXiv)
- Real repos: SWE-bench
, mind the leakage critiques when comparing papers. (arXiv)
C. “Math-heavy tutor / solver”
- Ladder: GSM8K → MATH → Omni-MATH; always show CoT vs direct. (arXiv)
Data hygiene & reporting checklist
- Decontamination: Prefer updated/closed splits like MMLU-CF; otherwise document dedup filters and pretraining sources. (arXiv)
- Judge methodology: If using LLM judges, cite model/version, rubric, and whether you used length-controlled scores. (arXiv)
- Multiple seeds: Especially for small test sets (e.g., HumanEval 164 items), report mean ± stderr. (arXiv)
- No cherry-picking: Pre-commit your config (task list, decoding, seeds), then run. HELM is a good pattern. (arXiv)
Minimal commands 《illustrative; lm-eval-harness》
- MMLU-Pro (0-shot, greedy):
lm_eval --model <your_model> --tasks mmlu_pro --batch_size auto --num_fewshot 0 --temperature 0
- HumanEval (pass@k via sampling):
lm_eval --model <your_model> --tasks humaneval --temperature 0.8 --num_fewshot 0 --samples_per_task 200 --greedy_until "
`" --allow_code_execution
- LongBench (subset):
lm_eval --model <your_model> --tasks longbench_en,longbench_zh --max_batch_size 1 --num_fewshot 0