Best-of

Best way to track AI evals and benchmarks

Focused best-of pages (builder workflow lens)

Last reviewed: 2026-06-26 · Policy: Editorial standards · Methodology

Decision in 20 seconds

The best way to track AI evals in 2026 combines three layers: (1) primary benchmark leaderboards (LMSYS Chatbot Arena at chat.lmsys.org, Open LLM Leaderboard at HuggingFace) for normalized cross-model comparisons; (2) model card self-reporting for specific claims with source links; and (3) third-party verification via Papers With Code. Trusting a model's claimed MMLU or MATH-500 score without checking the underlying eval setup is risky — Berkeley researchers demonstrated in early 2026 that mainstream benchmark gaming is systematic. For current reference: Qwen3-235B (April 2026) reports MMLU 87.1; DeepSeek-R1-0528 (May 2026) reports MATH-500 97.3% and AIME 2024 pass@1 72.6%, both verifiable on their HuggingFace model cards (huggingface.co/deepseek-ai/DeepSeek-R1-0528). The practical workflow for builders: weekly check of LMSYS Arena ELO + model cards for models in your stack; immediate check of Papers With Code when a new benchmark claim needs verification.

Use this page when

  • You need to decide between two models for a production integration and want unbiased performance comparison — use LMSYS Arena (category-filtered) + Open LLM Leaderboard together.
  • A model's benchmark claim seems unusually high and you want to verify if there's reproducible code behind it — use Papers With Code to find the paper and linked evaluation scripts.
  • You're tracking math/reasoning model releases and want to contextualize a new MATH-500 or AIME score (like DeepSeek-R1-0528's 97.3% on MATH-500) — check Papers With Code MATH leaderboard for historical context.
  • You want to understand capability trends across model generations, not just point-in-time scores — use Scale AI or Epoch AI quarterly reports.
  • You're evaluating coding model quality and distrust HumanEval scores — use EvalPlus for a harder, contamination-resistant assessment.

This page is not for

  • Real-time release alerts — leaderboards update asynchronously; use GitHub/HuggingFace feeds for release timing.
  • Evaluating closed-source models (GPT-4o, Claude 3.5) on Open LLM Leaderboard — it covers open models only; use LMSYS Arena for cross-provider comparisons.
  • Business/deployment decisions without domain testing — no public benchmark perfectly predicts performance on your specific use case; run evals on your own data as the final step.

Key points

  • LMSYS Chatbot Arena (chat.lmsys.org) uses pairwise human preference voting (ELO rating) across 1M+ battles; it is the most manipulation-resistant eval because it cannot be gamed by fine-tuning for benchmark format — users ask real questions and compare real outputs.
  • Open LLM Leaderboard on HuggingFace (huggingface.co/spaces/open-llm-leaderboard) evaluates open models on ARC, HellaSwag, MMLU, TruthfulQA, and an Extended suite including LiveCodeBench and IFEval added in 2026 to reduce benchmark gaming.
  • Papers With Code (paperswithcode.com) maintains state-of-the-art leaderboards for specific tasks (MATH, HumanEval, GPQA Diamond, AIME) and links each result to a paper and often to the evaluation code — the key resource for verifying 'can this score be reproduced?'
  • Berkeley researchers published findings in early 2026 showing systemic flaws in mainstream benchmarks — models can appear to score higher by being trained on benchmark-adjacent data or by exploiting benchmark format patterns. This makes independent verification via LMSYS Arena or EvalPlus essential.
  • EvalPlus (github.com/evalplus/evalplus) provides contamination-resistant coding evaluations (HumanEval+ and MBPP+) with larger test sets than the originals; DeepSeek-R1-0528's HumanEval-equivalent claim should be cross-checked against EvalPlus+ results.
  • Model cards on HuggingFace now routinely include eval configurations (the exact prompts, sampling parameters, and metric computation used) — Qwen3's model card at github.com/QwenLM/Qwen3 includes full evaluation setup for MMLU 87.1 and other reported scores.
  • Scale AI and Epoch AI publish independent AI capability evaluation reports roughly quarterly; these are distinct from benchmark leaderboards and more useful for understanding cross-domain capability trends rather than point-in-time model comparisons.

What changed recently

  • May 2026: DeepSeek-R1-0528 published on HuggingFace with AIME 2024 pass@1 72.6% (up from 70.0% prior R1), MATH-500 97.3%, GPQA Diamond 81.0% — eval results verifiable on the model card at huggingface.co/deepseek-ai/DeepSeek-R1-0528.
  • April 2026: Qwen3 series (github.com/QwenLM/Qwen3) reports MMLU 87.1 for 235B flagship; Qwen3-30B-A3B achieves MMLU 79.4, MATH-500 94.0, HumanEval 92.1 using only 3B active parameters (MoE architecture).
  • April 2026: Open LLM Leaderboard extended with LiveCodeBench and IFEval tasks, making it harder for models to inflate scores through benchmark-specific fine-tuning.
  • Early 2026: Berkeley researchers published analysis showing that top-ranked models on several mainstream benchmarks showed signs of benchmark contamination or format exploitation — cited in RadarAI AI Briefing Issue #200 (April 13, 2026).
  • Ongoing: LMSYS Chatbot Arena ELO rankings update continuously with new model entries; Gemini 2.0 Flash, GPT-4o-mini, and Qwen3 instruct variants have all been added in 2026.

Explanation

The fundamental tension in AI eval tracking is between convenience and rigor. Press releases and model cards give you a number quickly, but they reflect the best possible setup the publishing lab found — specific prompts, specific sampling temperature, potentially after iterating on the eval setup. Independent evaluations (LMSYS Arena, Open LLM Leaderboard) use standardized, lab-agnostic setups, which is why their numbers often differ from self-reported ones. The responsible approach is to use both: self-reported scores for initial filtering, independent evaluations for decision-making.

LMSYS Chatbot Arena (chat.lmsys.org) is uniquely valuable because its evaluation mechanism is adversarial to gaming. Models are compared head-to-head on real user questions, and the user doesn't know which model is which until after voting. A model trained to perform well on specific benchmark formats gets no advantage here. As of 2026, the Arena has conducted over 1 million battles and has clear ELO separation between model families. Its limitation: it only covers instruction-following quality, not specific technical tasks (math, coding, reasoning).

For technical benchmarks, Papers With Code is the most reliable resource because it anchors scores to papers and code. When DeepSeek-R1-0528 claims AIME 2024 pass@1 72.6%, you can verify this against the Papers With Code AIME leaderboard (paperswithcode.com/sota/math-word-problem-solving-on-aime-24) and check whether the evaluation code is published. If a score appears only in a press release with no linked code, it warrants skepticism.

Benchmark gaming — training on benchmark-adjacent data or tuning for benchmark format — has become a documented problem in 2026. Berkeley's research showed specific patterns: models that scored anomalously high on multiple-choice benchmarks while underperforming on open-ended generation. This is why the extended LMSYS Arena and task-specific benchmarks like EvalPlus (which uses 8x more test cases than original HumanEval) have become more important than aggregate scores like a single MMLU number.

For practical builder decision-making, the relevant eval questions are usually domain-specific: 'How does Model X perform on code generation tasks similar to what I'm building?' rather than 'What is Model X's MMLU score?' Domain-specific benchmarks (SWE-Bench for software engineering, GPQA Diamond for expert-level reasoning, LiveCodeBench for realistic coding) map more directly to real deployment performance than general benchmarks. LMSYS Arena lets you filter by category (coding, math, etc.) for domain-specific ELO.

AI Eval Source Selection: What to Use for What Decision

Different evaluation sources answer different questions. Match the source to the decision you're making.

How to verify the answer

These are the canonical sources for AI evaluation tracking.

Tools / Examples

  • LMSYS Chatbot Arena — chat.lmsys.org — human preference ranking via pairwise blind comparison. 1M+ battles (as of 2026), updated continuously as new models are added. Category-filtered ELO (coding, math, instruction following) available. Best source for gaming-resistant quality assessment. Qwen3, GPT-4o, Claude 3.5, Gemini 2.0 Flash are all tracked.
  • Open LLM Leaderboard — huggingface.co/spaces/open-llm-leaderboard — HuggingFace community leaderboard for open models. Evaluates on ARC, HellaSwag, MMLU, TruthfulQA + Extended suite (LiveCodeBench, IFEval added 2026). Scores are computed with standardized prompts, not self-reported. Only covers openly released model weights.
  • Papers With Code (MATH leaderboard) — paperswithcode.com/sota/math-word-problem-solving-on-math — tracks MATH benchmark SOTA with code links. DeepSeek-R1-0528's MATH-500 97.3% can be contextualised here against historical scores (GPT-4o: ~76.6%, Claude 3.5 Sonnet: ~78.3% on MATH). Leaderboard updates when researchers submit results.
  • Papers With Code (AIME leaderboard) — paperswithcode.com/sota/math-word-problem-solving-on-aime-24 — tracks AIME 2024 pass@1 results. DeepSeek-R1-0528: 72.6%. Useful context: AIME 2024 is a high school olympiad competition; 72.6% pass@1 is significantly above human average and among the highest published scores as of May 2026.
  • EvalPlus — evalplus.github.io — contamination-resistant coding evaluation (HumanEval+ and MBPP+). Uses 8x more test cases than original HumanEval to surface models that overfit to standard test cases. DeepSeek-R1-0528 and Qwen3 instruct variants both have entries. Essential for verifying coding claims.
  • GPQA Diamond — Measured on paperswithcode.com/dataset/gpqa — expert-level multiple choice questions in biology, chemistry, and physics that PhD-level researchers find challenging (human expert accuracy ~69%). DeepSeek-R1-0528 reports 81.0%. A strong proxy for deep expert reasoning, harder to inflate than MMLU.
  • QwenLM Qwen3 Model Card — github.com/QwenLM/Qwen3 — Alibaba's primary publication point for Qwen3 evals. Reports MMLU 87.1 (235B), MATH-500 94.0 (30B-A3B), HumanEval 92.1 (30B-A3B). Includes full evaluation configuration — which prompts, which sampling settings — allowing reproduction.
  • DeepSeek-R1-0528 Model Card — huggingface.co/deepseek-ai/DeepSeek-R1-0528 — primary source for DeepSeek-R1-0528 benchmark results. AIME 2024 pass@1: 72.6%; MATH-500: 97.3%; GPQA Diamond: 81.0%. Model card includes usage instructions and full benchmark context. Published May 2026.
  • Scale AI Eval Reports — scale.com/research — Scale AI publishes independent capability evaluations (Scale's HELM, domain-specific evals) roughly quarterly. Coverage includes coding, math, long-context, and safety benchmarks. Independent of any specific lab, making it useful for cross-lab capability comparisons.
  • Epoch AI — epochai.org — research organization tracking AI progress via compute trends, training data, and benchmark performance over time. Their 'AI and Compute' and 'Milestones' databases provide longitudinal context for understanding whether a new benchmark score represents genuine progress or incremental improvement.
  • RadarAI AI Briefings (eval context) — radarai.top/en/updates — daily AI briefings include benchmark context when major model releases occur. Issue #200 (April 13, 2026) covered Berkeley's benchmark gaming findings; release briefings for Qwen3 and DeepSeek-R1-0528 included MMLU/MATH-500/AIME context with source links.
  • LiveCodeBench — livecodebench.github.io — real-world coding benchmark using problems from competitive programming contests (LeetCode, Codeforces, AtCoder) collected after popular model training cutoffs. Specifically designed to resist contamination. Added to Open LLM Leaderboard Extended suite in 2026.

Evidence timeline

Human preference ELO ranking, 1M+ battles. Updated continuously. Qwen3, GPT-4o, Claude 3.5, Gemini 2.0 Flash tracked. Category-filtered ELO available for coding, math, instruction following.

Extended benchmark suite added April 2026 (LiveCodeBench, IFEval). Standardized eval setup for open models — ARC, HellaSwag, MMLU, TruthfulQA plus Extended tasks.

AIME 2024 pass@1 leaderboard. DeepSeek-R1-0528: 72.6% (May 2026), among highest published scores. Scores linked to papers and eval code.

Primary source for DeepSeek-R1-0528 evals: AIME 2024 72.6%, MATH-500 97.3%, GPQA Diamond 81.0%. Published May 2026 with full eval configuration.

Primary source for Qwen3 benchmark results: Qwen3-235B MMLU 87.1, Qwen3-30B-A3B MMLU 79.4 / MATH-500 94.0 / HumanEval 92.1. Apache 2.0 license. Full eval config included.

Contamination-resistant coding eval. HumanEval+: 8x more test cases than standard HumanEval. MBPP+: enhanced MBPP. Specifically designed to surface models that overfit to standard coding benchmarks.

Task-specific SOTA leaderboards with code links. Tracks MATH, MMLU, HumanEval, GPQA Diamond, LiveCodeBench, SWE-Bench. Essential for verifying whether benchmark claims have reproducible evaluation code.

Independent AI capability evaluation reports. Quarterly cadence. Covers coding, math, long-context, and safety benchmarks. Used for cross-lab trend analysis independent of individual lab's model cards.

AI progress tracking via compute trends and benchmark data. Longitudinal context for whether new benchmark scores represent genuine progress.

Berkeley research on benchmark gaming covered: 'Berkeley research exposes systemic flaws in mainstream AI benchmarks—models can game scores.' Source: RadarAI daily briefing.

Real-world coding benchmark using competition problems collected after model training cutoffs. Added to Open LLM Leaderboard Extended suite in 2026 to resist contamination.

Expert-level science Q&A benchmark. Human expert accuracy: ~69%. DeepSeek-R1-0528: 81.0%. Strong proxy for genuine expert-level reasoning capability.

Sources

FAQ

Can I trust self-reported MMLU or MATH-500 scores from model cards?

As a starting point, yes. As a final answer, no. Self-reported scores reflect the lab's chosen evaluation setup (prompts, sampling temperature, shot count) which may not match standardized setups on leaderboards. Qwen3-235B's MMLU 87.1 and DeepSeek-R1-0528's MATH-500 97.3% are both plausible based on independent evaluations, but always cross-check against Papers With Code leaderboard scores or Open LLM Leaderboard when making technology decisions.

What's the difference between MATH and MATH-500?

MATH is the full 12,500-problem benchmark (Hendrycks et al. 2021) covering competition math. MATH-500 is a curated 500-problem subset designed to be representative and faster to evaluate. Most recent model cards (Qwen3, DeepSeek-R1) report MATH-500 scores because running the full MATH evaluation is costly. The scores are directionally comparable but not directly interchangeable. Papers With Code tracks both.

What does AIME 2024 pass@1 mean, and is 72.6% impressive?

AIME (American Invitational Mathematics Examination) 2024 pass@1 measures the probability of correctly solving each problem in the 2024 exam on the first attempt. Average human performance is around 10–20%; AMC/AIME top-tier students score 40–60%. DeepSeek-R1-0528's 72.6% is among the highest published scores as of May 2026 and represents genuine mathematical reasoning capability, not just pattern matching. Verify at paperswithcode.com/sota/math-word-problem-solving-on-aime-24.

Is LMSYS Arena reliable given the volume of battles?

Yes, with caveats. The large sample size (1M+ battles) makes ELO ratings statistically robust for top models. The main caveat: user demographics skew toward English-speaking tech-literate users, so results may not represent business or non-English use cases. For general instruction following quality, it's the most reliable non-gameable eval. For coding or math specifically, filter to the relevant category or use domain-specific benchmarks.

How do I know if a model's benchmark score is due to benchmark contamination?

Three checks: (1) Does the model score significantly higher on benchmarks than on human preference evaluations (LMSYS Arena)? Contamination inflates benchmark scores more than preference. (2) Check EvalPlus for coding claims — it uses harder test sets specifically to resist contamination. (3) Look for Berkeley or independent analysis; Papers With Code sometimes flags contested results. No method is foolproof, but these three together catch most gaming patterns.

How often should I check AI evaluation leaderboards?

Monthly check of LMSYS Arena top rankings is sufficient to know whether a new model has displaced your current stack. Weekly check if you're actively evaluating model swap decisions. Open LLM Leaderboard updates with new model submissions; check when specific models you're evaluating are newly listed. Papers With Code SOTA pages update when new papers submit — subscribe to specific task RSS feeds for automated alerting.

What benchmarks matter most for code generation models?

For code generation: HumanEval+ (EvalPlus, harder test cases), MBPP+ (EvalPlus), LiveCodeBench (competition problems post-training-cutoff), and SWE-Bench (real GitHub issues). SWE-Bench is the most realistic for software engineering tasks. Qwen3-30B-A3B reports HumanEval 92.1%; DeepSeek-R1-0528 has strong coding benchmark results as well. Cross-check all coding claims against EvalPlus to account for standard HumanEval contamination.

Search angles this page supports

Related

Go deeper

Last updated: 2026-06-26 · Policy: Editorial standards · Methodology