大模型测评基准

0) Your evaluation stack 【don’t skip this】

  • Harnesses: Prefer a standard runner so decoding, few-shoting, and caching are consistent:

    • EleutherAI lm-evaluation-harness (CLI; wide task coverage; now with multimodal prototypes). (GitHub
    • OpenCompass (big task zoo, Chinese/English, good configs + leaderboards). (GitHub
    • HELM for multi-metric, multi-scenario, transparent reporting (accuracy, calibration, robustness, bias, efficiency…). (arXiv
  • Report the knobs: temperature/top-p, max tokens, stop sequences, few-shot k, with/without CoT, and random seed. For code, report pass@k and the sample budget (HumanEval paper formalizes pass@k). (arXiv
  • Bias control: If you rely on LLM-as-a-judge, guard against verbosity/position bias. MT-Bench/Arena papers discuss these issues and mitigations; AlpacaEval 2.0 adds length-controlled win-rates. (arXiv
  • Contamination & saturation: Prefer “hard” refreshes and closed or decontaminated splits (e.g., MMLU-Pro, MMLU-CF, BBEH). (arXiv

1) Breadth of knowledge & general reasoning

  • MMLU — 57 subjects, multiple-choice; broad “book-smarts.” Use as a baseline, but it’s saturated for many families. (arXiv
  • MMLU-Pro — tougher, reasoning-heavier, 10 options per question; less prompt-sensitive than MMLU. Use this when MMLU tops out. (arXiv
  • BIG-bench (BB) — 200+ eclectic tasks; exploratory coverage, not a single score. (arXiv
  • BBH (BIG-bench Hard) — 23 tasks where older LMs underperformed humans; report with & without CoT. (GitHub
  • BBEH (Extra Hard) — 2025 refresh that replaces each BBH task with a harder analogue; great when BBH saturates. (arXiv

Run tips: fix few-shot exemplars; avoid leaking rationales if you will also report “no-CoT” scores. Prefer accuracy (± CI) over single-run numbers.


2) Commonsense, science QA, and truthfulness

  • HellaSwag — adversarial commonsense story/scene completion; good for catching shallow pattern-matching. (arXiv
  • ARC (Challenge set) — grade-school science that requires reasoning beyond lookup. Report ARC-Challenge accuracy. (arXiv
  • PIQA — physical commonsense (“which action works?”). (arXiv
  • TruthfulQA — measures tendency to mimic popular falsehoods; report “% truthful.” (arXiv

Run tips: forbid external tools unless the benchmark allows it; for TruthfulQA, keep decoding conservative to reduce confabulation.


3) Math reasoning 【from word problems to olympiad-level】

  • GSM8K — grade-school word problems; exact-match accuracy; sensitive to CoT prompting and sampling. (arXiv
  • MATH — 12.5K competition problems with solutions; much harder than GSM8K. (arXiv
  • Omni-MATH — olympiad-level, 4K+ problems across 30+ subdomains; includes rule-based eval hooks. Use when MATH saturates. (arXiv

Run tips: normalize answers; strip spaces/LaTeX; for sampling-based methods, report n_samples and selection heuristic.


4) Code generation & real software work

  • HumanEval — function synthesis from docstrings; pass@k with execution-based tests. Don’t report single-sample pass@1 only. (arXiv
  • MBPP — ~1K beginner-to-intermediate Python tasks; execution-based grading; a hand-verified subset exists. (arXiv
  • SWE-bench (+ variants) — apply patches to real GitHub repos/issues under Docker; success = tests pass. Use SWE-bench or Verified for clean comparisons; read recent analyses about leakage and weak tests. (arXiv

Run tips: pin Python, OS, and dependency versions; cap tool-use/timeouts; log flaky tests.


5) Chinese-language comprehensive suites

  • C-Eval — 13,948 MCQs, 52 disciplines, four difficulty levels; includes a Hard subset. (arXiv
  • CMMLU — Chinese analogue of MMLU across 60+ subjects; many China-specific answers. (arXiv
  • AGIEval — standardized exams (Gaokao/SAT/LSAT/lawyer qualification); closer to human-task framing. (arXiv
  • Xiezhi — ever-updating, domain-knowledge breadth with specialty/interdiscipline subsets. (arXiv

Run tips: lock Chinese tokenization & punctuation handling; watch for ambiguous region-specific conventions.


6) Dialogue quality & human preference

  • MT-Bench — curated multi-turn prompts; typically LLM-judged; paper details judge biases & mitigations. (arXiv
  • Chatbot Arena (Elo) — large-scale, pairwise human preference leaderboard; strongest external sanity check. (arXiv
  • AlpacaEval 2.0 (length-controlled) — cheap, fast, and correlates highly with Arena once length bias is removed. (arXiv
  • Arena-Hard — hard prompts distilled from live Arena data; higher separation than MT-Bench. (arXiv

Run tips: cap output length and/or use length-controlled metrics to avoid “longer = better.”


7) Long-context understanding

  • LongBench / LongBench v2 — bilingual, multi-task long-context suite; v2 pushes deeper reasoning and longer spans. (arXiv
  • L-Eval — standardized long-context datasets (3k–200k tokens) and guidance on better metrics (LLM-judge + length-instruction). (arXiv
  • RULER — synthetic but systematic “needle,” multi-hop tracing, and aggregation tasks to probe effective context length. (arXiv
  • Needle-in-a-Haystack (and multi-needle variants) — quick sanity checks for retrieval over long inputs. (LangChain Blog

Run tips: don’t equate advertised window size with usable reasoning span; measure retrieval, multi-hop, and aggregation separately.


8) Tool use, web navigation, and agents

  • WebArena — self-hostable, fully functional websites; end-to-end task success is the KPI. Humans ~78%, strong LLM agents far lower. (arXiv
  • VisualWebArena — adds visually grounded web tasks (UIs/screenshots), exposing gaps for multimodal agents. (arXiv
  • AgentBench — 8 interactive environments testing decision-making as an agent. (arXiv
  • GAIA — “general assistant” tasks requiring browsing/tool use; big human–model gap highlights practical brittleness. (arXiv

Run tips: log tool calls, retries, and partial progress; track both success rate and side-effects (wrong edits, unsafe actions).


9) Retrieval-augmented generation 【RAG】 & factuality

  • BEIR — 18 IR datasets for zero-shot retrieval generalization; use it to stress your retriever before RAG end-to-end. (arXiv
  • KILT — knowledge-intensive tasks grounded to a fixed Wikipedia snapshot; evaluates provenance and downstream performance together. (arXiv
  • RAGAS / ARES — automated RAG evaluation (faithfulness, context relevance, answer quality) without heavy human labels. (arXiv
  • FActScore — fine-grained factual precision for long-form generation via atomic fact checking. (arXiv

Run tips: report retriever metrics (nDCG@k, Recall@k) and generator faithfulness; ablate chunking, k, reranking.


10) Safety & jailbreak robustness

  • RealToxicityPrompts — measures toxic degeneration under realistic prompts. (arXiv
  • SafetyBench — 11,435 MCQs across 7 safety categories; multiple-choice format makes comparisons easy. (arXiv
  • JailbreakBench — open, evolving suite of jailbreak artifacts + standardized evaluation and leaderboard. (arXiv

Run tips: run both attack success and over-refusal checks; log defense configs (system prompts, filters) so results are reproducible.


Practical recipes --《copy/paste leve》

A. “General assistant (EN/ZH), safe & sane”

  • Breadth: MMLU-Pro, BBH/BBEH. (arXiv
  • Commonsense/truth: HellaSwag, ARC-Challenge, TruthfulQA. (arXiv
  • Chinese: C-Eval, CMMLU, AGIEval, Xiezhi. (NIPS 会议论文
  • Long context: LongBench v2 + RULER. (arXiv
  • Safety: SafetyBench, JailbreakBench. (arXiv

B. “Coding copilot that really fixes bugs”

  • Synthesis: HumanEval, MBPP (execution-based). (arXiv
  • Real repos: SWE-bench , mind the leakage critiques when comparing papers. (arXiv

C. “Math-heavy tutor / solver”

  • Ladder: GSM8K → MATH → Omni-MATH; always show CoT vs direct. (arXiv

Data hygiene & reporting checklist

  • Decontamination: Prefer updated/closed splits like MMLU-CF; otherwise document dedup filters and pretraining sources. (arXiv
  • Judge methodology: If using LLM judges, cite model/version, rubric, and whether you used length-controlled scores. (arXiv
  • Multiple seeds: Especially for small test sets (e.g., HumanEval 164 items), report mean ± stderr. (arXiv
  • No cherry-picking: Pre-commit your config (task list, decoding, seeds), then run. HELM is a good pattern. (arXiv

Minimal commands 《illustrative; lm-eval-harness》

  • MMLU-Pro (0-shot, greedy):
    lm_eval --model <your_model> --tasks mmlu_pro --batch_size auto --num_fewshot 0 --temperature 0
  • HumanEval (pass@k via sampling):
    lm_eval --model <your_model> --tasks humaneval --temperature 0.8 --num_fewshot 0 --samples_per_task 200 --greedy_until "`" --allow_code_execution
  • LongBench (subset):
    lm_eval --model <your_model> --tasks longbench_en,longbench_zh --max_batch_size 1 --num_fewshot 0

GitHub


When the benchmarks disagree 《they will》

  • A model that shines on MMLU-Pro may stumble on TruthfulQA -truthfulness ≠ knowledge-. (arXiv
  • Strong HumanEval doesn’t guarantee SWE-bench success -function synthesis ≠ repo surgery-. (arXiv
  • “1M token window” claims need RULER/LongBench v2 to verify effective reasoning length. (arXiv

添加新评论