0） Your evaluation stack 【don’t skip this】

Harnesses: Prefer a standard runner so decoding, few-shoting, and caching are consistent:
- EleutherAI lm-evaluation-harness （CLI; wide task coverage; now with multimodal prototypes）. （GitHub）
- OpenCompass （big task zoo, Chinese/English, good configs + leaderboards）. （GitHub）
- HELM for multi-metric, multi-scenario, transparent reporting （accuracy, calibration, robustness, bias, efficiency…）. （arXiv）
Report the knobs: temperature/top-p, max tokens, stop sequences, few-shot k, with/without CoT, and random seed. For code, report pass@k and the sample budget （HumanEval paper formalizes pass@k）. （arXiv）
Bias control: If you rely on LLM-as-a-judge, guard against verbosity/position bias. MT-Bench/Arena papers discuss these issues and mitigations; AlpacaEval 2.0 adds length-controlled win-rates. （arXiv）
Contamination & saturation: Prefer “hard” refreshes and closed or decontaminated splits （e.g., MMLU-Pro, MMLU-CF, BBEH）. （arXiv）

1） Breadth of knowledge & general reasoning

MMLU — 57 subjects, multiple-choice; broad “book-smarts.” Use as a baseline, but it’s saturated for many families. （arXiv）
MMLU-Pro — tougher, reasoning-heavier, 10 options per question; less prompt-sensitive than MMLU. Use this when MMLU tops out. （arXiv）
BIG-bench （BB） — 200+ eclectic tasks; exploratory coverage, not a single score. （arXiv）
BBH （BIG-bench Hard） — 23 tasks where older LMs underperformed humans; report with & without CoT. （GitHub）
BBEH （Extra Hard） — 2025 refresh that replaces each BBH task with a harder analogue; great when BBH saturates. （arXiv）

Run tips: fix few-shot exemplars; avoid leaking rationales if you will also report “no-CoT” scores. Prefer accuracy （± CI） over single-run numbers.

2） Commonsense, science QA, and truthfulness

HellaSwag — adversarial commonsense story/scene completion; good for catching shallow pattern-matching. （arXiv）
ARC （Challenge set） — grade-school science that requires reasoning beyond lookup. Report ARC-Challenge accuracy. （arXiv）
PIQA — physical commonsense （“which action works?”）. （arXiv）
TruthfulQA — measures tendency to mimic popular falsehoods; report “% truthful.” （arXiv）

Run tips: forbid external tools unless the benchmark allows it; for TruthfulQA, keep decoding conservative to reduce confabulation.

3） Math reasoning 【from word problems to olympiad-level】

GSM8K — grade-school word problems; exact-match accuracy; sensitive to CoT prompting and sampling. （arXiv）
MATH — 12.5K competition problems with solutions; much harder than GSM8K. （arXiv）
Omni-MATH — olympiad-level, 4K+ problems across 30+ subdomains; includes rule-based eval hooks. Use when MATH saturates. （arXiv）

Run tips: normalize answers; strip spaces/LaTeX; for sampling-based methods, report n_samples and selection heuristic.

4） Code generation & real software work

HumanEval — function synthesis from docstrings; pass@k with execution-based tests. Don’t report single-sample pass@1 only. （arXiv）
MBPP — ~1K beginner-to-intermediate Python tasks; execution-based grading; a hand-verified subset exists. （arXiv）
SWE-bench （+ variants） — apply patches to real GitHub repos/issues under Docker; success = tests pass. Use SWE-bench or Verified for clean comparisons; read recent analyses about leakage and weak tests. （arXiv）

Run tips: pin Python, OS, and dependency versions; cap tool-use/timeouts; log flaky tests.

5） Chinese-language comprehensive suites

C-Eval — 13,948 MCQs, 52 disciplines, four difficulty levels; includes a Hard subset. （arXiv）
CMMLU — Chinese analogue of MMLU across 60+ subjects; many China-specific answers. （arXiv）
AGIEval — standardized exams （Gaokao/SAT/LSAT/lawyer qualification）; closer to human-task framing. （arXiv）
Xiezhi — ever-updating, domain-knowledge breadth with specialty/interdiscipline subsets. （arXiv）

Run tips: lock Chinese tokenization & punctuation handling; watch for ambiguous region-specific conventions.

6） Dialogue quality & human preference

MT-Bench — curated multi-turn prompts; typically LLM-judged; paper details judge biases & mitigations. （arXiv）
Chatbot Arena （Elo） — large-scale, pairwise human preference leaderboard; strongest external sanity check. （arXiv）
AlpacaEval 2.0 （length-controlled） — cheap, fast, and correlates highly with Arena once length bias is removed. （arXiv）
Arena-Hard — hard prompts distilled from live Arena data; higher separation than MT-Bench. （arXiv）

Run tips: cap output length and/or use length-controlled metrics to avoid “longer = better.”

7） Long-context understanding

LongBench / LongBench v2 — bilingual, multi-task long-context suite; v2 pushes deeper reasoning and longer spans. （arXiv）
L-Eval — standardized long-context datasets （3k–200k tokens） and guidance on better metrics （LLM-judge + length-instruction）. （arXiv）
RULER — synthetic but systematic “needle,” multi-hop tracing, and aggregation tasks to probe effective context length. （arXiv）
Needle-in-a-Haystack （and multi-needle variants） — quick sanity checks for retrieval over long inputs. （LangChain Blog）

Run tips: don’t equate advertised window size with usable reasoning span; measure retrieval, multi-hop, and aggregation separately.

8） Tool use, web navigation, and agents

WebArena — self-hostable, fully functional websites; end-to-end task success is the KPI. Humans ~78%, strong LLM agents far lower. （arXiv）
VisualWebArena — adds visually grounded web tasks （UIs/screenshots）, exposing gaps for multimodal agents. （arXiv）
AgentBench — 8 interactive environments testing decision-making as an agent. （arXiv）
GAIA — “general assistant” tasks requiring browsing/tool use; big human–model gap highlights practical brittleness. （arXiv）

Run tips: log tool calls, retries, and partial progress; track both success rate and side-effects （wrong edits, unsafe actions）.

9） Retrieval-augmented generation 【RAG】 & factuality

BEIR — 18 IR datasets for zero-shot retrieval generalization; use it to stress your retriever before RAG end-to-end. （arXiv）
KILT — knowledge-intensive tasks grounded to a fixed Wikipedia snapshot; evaluates provenance and downstream performance together. （arXiv）
RAGAS / ARES — automated RAG evaluation （faithfulness, context relevance, answer quality） without heavy human labels. （arXiv）
FActScore — fine-grained factual precision for long-form generation via atomic fact checking. （arXiv）

Run tips: report retriever metrics （nDCG@k, Recall@k） and generator faithfulness; ablate chunking, k, reranking.

10） Safety & jailbreak robustness

RealToxicityPrompts — measures toxic degeneration under realistic prompts. （arXiv）
SafetyBench — 11,435 MCQs across 7 safety categories; multiple-choice format makes comparisons easy. （arXiv）
JailbreakBench — open, evolving suite of jailbreak artifacts + standardized evaluation and leaderboard. （arXiv）

Run tips: run both attack success and over-refusal checks; log defense configs （system prompts, filters） so results are reproducible.

Practical recipes --《copy/paste leve》

A. “General assistant （EN/ZH）, safe & sane”

Breadth: MMLU-Pro, BBH/BBEH. （arXiv）
Commonsense/truth: HellaSwag, ARC-Challenge, TruthfulQA. （arXiv）
Chinese: C-Eval, CMMLU, AGIEval, Xiezhi. （NIPS 会议论文）
Long context: LongBench v2 + RULER. （arXiv）
Safety: SafetyBench, JailbreakBench. （arXiv）

B. “Coding copilot that really fixes bugs”

Synthesis: HumanEval, MBPP （execution-based）. （arXiv）
Real repos: SWE-bench , mind the leakage critiques when comparing papers. （arXiv）

C. “Math-heavy tutor / solver”

Ladder: GSM8K → MATH → Omni-MATH; always show CoT vs direct. （arXiv）

Data hygiene & reporting checklist

Decontamination: Prefer updated/closed splits like MMLU-CF; otherwise document dedup filters and pretraining sources. （arXiv）
Judge methodology: If using LLM judges, cite model/version, rubric, and whether you used length-controlled scores. （arXiv）
Multiple seeds: Especially for small test sets （e.g., HumanEval 164 items）, report mean ± stderr. （arXiv）
No cherry-picking: Pre-commit your config （task list, decoding, seeds）, then run. HELM is a good pattern. （arXiv）

Minimal commands 《illustrative; lm-eval-harness》

MMLU-Pro （0-shot, greedy）:
lm_eval --model <your_model> --tasks mmlu_pro --batch_size auto --num_fewshot 0 --temperature 0
HumanEval （pass@k via sampling）:
lm_eval --model <your_model> --tasks humaneval --temperature 0.8 --num_fewshot 0 --samples_per_task 200 --greedy_until "`" --allow_code_execution
LongBench （subset）:
lm_eval --model <your_model> --tasks longbench_en,longbench_zh --max_batch_size 1 --num_fewshot 0

（GitHub）

When the benchmarks disagree 《they will》

A model that shines on MMLU-Pro may stumble on TruthfulQA -truthfulness ≠ knowledge-. （arXiv）
Strong HumanEval doesn’t guarantee SWE-bench success -function synthesis ≠ repo surgery-. （arXiv）
“1M token window” claims need RULER/LongBench v2 to verify effective reasoning length. （arXiv）

大模型测评基准