Best sites to track AI model releases

Decision in 20 seconds

The best sites to track AI model releases are organized by what you need to know: (1) for weights, benchmarks, and licensing on day one, use HuggingFace model hubs (huggingface.co/deepseek-ai, huggingface.co/Qwen, huggingface.co/mistralai) and official lab GitHub organizations — they publish model cards before media; (2) for the fastest news signal, TLDR AI sends a daily email within hours of a major release; (3) for Chinese lab releases specifically (Qwen, DeepSeek, Kimi, GLM), RadarAI covers them with deployment-relevant framing that English digests typically miss. In Q2 2026, the two most significant open-source releases were Qwen3 (April 2026, Apache 2.0, MMLU 87.1 for the 235B flagship) and DeepSeek-R1-0528 (May 2026, AIME 2024 pass@1 72.6%, MATH-500 97.3%) — both were trackable on HuggingFace before Western media coverage and verified in under 10 minutes by reading the model card.

Use this page when

Setting up a monitoring system for open-weight AI model releases to evaluate for integration
Verifying a specific benchmark claim before building a product on top of a model
Tracking Chinese AI lab model releases that are not covered adequately in Western digests
Comparing models for a specific task (code generation, math reasoning, long-context retrieval) using task-specific leaderboards

This page is not for

Closed/proprietary model API updates (GPT-4o, Claude) — for those, use vendor changelog pages directly, not general tracking sites
AI hardware and infrastructure news — use SemiAnalysis, The Next Platform, or AnandTech for chip and data center tracking
AI company financial or corporate news — for acquisitions, funding, and market moves, use KR-Asia, TechCrunch, or Bloomberg

Key points

HuggingFace model hubs are the single most reliable primary source for model releases: lab organizations publish model cards with benchmark data, weights, and license terms before any press coverage — bookmark specific org pages and use the "follow" feature for release notifications.
The QwenLM GitHub organization (github.com/QwenLM) and DeepSeek GitHub (github.com/deepseek-ai) are the authoritative primary sources for Chinese open-source models — their release notes contain training details, benchmark methodology, and inference requirements not carried in secondary coverage.
RadarAI tracks Chinese AI model releases for English-reading builders with deployment framing — context window, API cost, open-source license status — that is absent from most English digests that undercover Chinese labs until their releases become too prominent to ignore.
The Hugging Face Open LLM Leaderboard (huggingface.co/spaces/open-llm-leaderboard) provides ongoing independent benchmark evaluation of open-weight models — scores are reproduced independently rather than lab-reported, making it a useful cross-check when a model card's benchmark claims seem high.
LMSYS Chatbot Arena (lmarena.ai) is the best real-world ranking signal because scores come from actual human preference votes rather than automated benchmarks — models that score well on Arena but poorly on MMLU are typically better for chat and instruction-following tasks than the static numbers suggest.
Papers With Code maintains task-specific benchmark leaderboards (code: LiveCodeBench; math: MATH-500 and AIME; reasoning: GPQA Diamond) that are updated with new model scores as labs publish evaluations — useful for tracking progress on specific tasks you actually use.
GitHub star velocity (how fast a model repo gains stars in the first 48 hours after release) is an underrated signal for developer adoption momentum — a model gaining 5,000 stars in 24 hours signals genuine developer interest more reliably than PR press coverage.

What changed recently

May 2026: DeepSeek-R1-0528 released on HuggingFace with full model card — AIME 2024 pass@1 72.6%, MATH-500 97.3%, GPQA Diamond 81.0%. Model weights publicly available. HuggingFace model card published simultaneously with the release blog post at blog.deepseek.com.
April 2026: Qwen3 series launched across 0.6B to 235B parameter sizes, all under Apache 2.0 license. Critical deployment detail: Qwen3-30B-A3B is a MoE model with 30B total parameters but only 3B active at inference — this makes it significantly cheaper to serve than the parameter count implies. HuggingFace org (huggingface.co/Qwen) published cards for all sizes simultaneously.
March 2026: Kimi k2 (Moonshot AI) added 128K context window support — model card published at huggingface.co/moonshotai. English coverage lagged primary source by 3-4 days.
Ongoing 2026: The Open LLM Leaderboard v2 now evaluates on harder benchmarks including IFEval (instruction following), BBH (Big Bench Hard), and MATH Lvl 5 — older MMLU-only comparisons are less useful than v2 multi-benchmark scores for distinguishing top-tier models.

Explanation

Model releases in 2026 follow a consistent publication pattern that determines where you should look first. Labs publish model weights and model cards to HuggingFace and/or GitHub simultaneously with or before their blog announcement. The model card is the technical specification document — it contains benchmark scores, evaluation methodology, training data composition summary, license terms, and inference requirements. Everything in secondary coverage (media, newsletters, social media) is derived from the model card and the blog post, with varying levels of accuracy in the derivation.

The benchmark interpretation problem is significant for builders. Most media coverage reports MMLU scores without context, but MMLU scores require context to be meaningful: what shot count (0-shot vs 5-shot can differ by 5-15 points), which MMLU version (original MMLU vs MMLU-Pro have different difficulty distributions), and whether the evaluation was run by the lab or independently reproduced. The Open LLM Leaderboard solves the third problem by independently reproducing evaluations for open-weight models — if a model scores 85 on MMLU in the lab's report but 76 when independently evaluated on the Leaderboard, that is a meaningful signal about the reliability of the lab's benchmark process.

Chinese open-source models deserve specific attention because they represent the best cost-performance ratio for many specific tasks in 2026, yet are systematically undercovered in English-language tracking sources. Qwen3-30B-A3B (April 2026) is the clearest example: it uses a MoE architecture with 30B total parameters but only 3B active at inference time, which means it runs at the speed and cost of a 3B dense model while delivering benchmark performance comparable to much larger dense models. This efficiency advantage is obvious from the model card but was absent or buried in most English media coverage, which focused on the headline 235B flagship instead.

The LMSYS Chatbot Arena is valuable for a different reason than static benchmarks. Arena scores come from thousands of real human preference votes across a wide range of tasks — users chat with two anonymous models and vote for the one they prefer. This produces an ELO-style ranking that reflects real-world usability more accurately than MMLU or coding benchmarks designed to be reproducible but not necessarily representative of how models get used. A model with a high Arena score but modest MMLU is typically excellent for conversational and generalist tasks; one with high MMLU but low Arena may be benchmark-optimized in ways that don't transfer to real usage.

For builders tracking specific capability areas, Papers With Code task leaderboards are more useful than overall capability rankings. The LiveCodeBench leaderboard (live.code-benchmark.org) evaluates code generation on problems post-dating the training cutoff of most models, making it more honest about current capability than HumanEval scores on problems that may be in training data. The AIME 2025 benchmark is becoming a standard for math reasoning — unlike MATH-500 which has been in training data for most current models, recent AIME competition problems provide a genuinely held-out evaluation.

Model Release Tracking Source Selection

Match your tracking need to the right source. Mixing up these categories is the most common error: checking tech media for benchmark details, or checking Reddit for licensing terms, wastes time and produces unreliable information.

How to verify the answer

These are the direct technical publishing points for model releases — check here first before any secondary source:

Tools / Examples

HuggingFace Model Hub — huggingface.co — Primary distribution platform for open-weight models. Follow lab organizations (deepseek-ai, Qwen, mistralai, meta-llama, google, moonshotai) for release notifications. Model cards contain benchmark data, licensing, inference requirements, and download links.
QwenLM GitHub — github.com/QwenLM — Alibaba's Qwen model organization. Published Qwen3 series (April 2026) with full benchmark data, Apache 2.0 license confirmation, and technical report explaining MoE architecture and training approach. Subscribe to release notifications for same-day alerts.
DeepSeek GitHub and HuggingFace — github.com/deepseek-ai + huggingface.co/deepseek-ai — Authoritative sources for all DeepSeek model releases. R1-0528 (May 2026) model card published simultaneously with training details, full benchmark suite, and weights. Contains methodology details absent from all secondary coverage.
Open LLM Leaderboard — huggingface.co/spaces/open-llm-leaderboard — Independent benchmark evaluation of open-weight models. V2 evaluates on IFEval, BBH, MATH Lvl 5, GPQA, MuSR, and MMLU-Pro — harder benchmarks than the original leaderboard. Use for cross-checking lab-reported scores.
LMSYS Chatbot Arena — lmarena.ai — ELO-style ranking from real human preference votes across thousands of model comparisons. Best signal for real-world chat and instruction-following quality. Includes both open and closed models. Updated continuously as new votes come in.
Papers With Code Benchmarks — paperswithcode.com/sota — Task-specific state-of-the-art leaderboards updated as labs publish new evaluation results. Use for domain-specific tracking: LiveCodeBench for code, MATH-500 / AIME for math reasoning, MMMU for multimodal, GPQA Diamond for graduate-level science reasoning.
RadarAI China AI Tracker — radarai.top/en — Weekly English digest covering Chinese lab model releases (Qwen, DeepSeek, Kimi, GLM, ERNIE) with deployment framing: context window, API availability, license status, and integration constraints. Fills the China AI coverage gap in other English digests.
TLDR AI Newsletter — tldr.tech/ai — Daily email with 2-3 sentence summaries of the most significant AI news. Typically covers major model releases within 4-8 hours of publication. Best for same-day awareness without requiring active monitoring.
LiveCodeBench — live.code-benchmark.org — Code generation benchmark using competition problems released after the training cutoff dates of current models — eliminates the training data contamination problem that makes HumanEval increasingly unreliable as a comparison tool. Updated monthly with new problems.
r/LocalLLaMA — reddit.com/r/LocalLLaMA — Practitioner community running open-weight models locally. First user evaluations of new model releases appear here within hours — valuable for real-world performance signal on consumer and workstation hardware, prompt sensitivity observations, and early integration reports.
Moonshot AI HuggingFace — huggingface.co/moonshotai — Official model cards for Kimi series. Kimi k2 (March 2026, 128K context) and prior Kimi releases tracked here. English documentation for Moonshot's models is cleaner on HuggingFace than on the main website.
OpenRouter — openrouter.ai — Aggregates API access to 50+ models from multiple providers with per-model pricing. Useful for tracking which models have API availability and at what cost per token — provides a practical deployment reality check alongside capability benchmarks.

Evidence timeline

2026-05

DeepSeek-R1-0528 model card: AIME 2024 pass@1 72.6% (up from 70.0%), MATH-500 97.3%, GPQA Diamond 81.0% — published with weights before English media coverage

2026-04

Qwen3 technical report and model cards — 235B MMLU 87.1; MoE 30B-A3B model with 3B active parameters at inference, Apache 2.0 license across all sizes

2026

Independent benchmark reproduction for open-weight models — IFEval, BBH, MATH Lvl 5, GPQA, MuSR, MMLU-Pro; cross-checks lab-reported scores

2026

ELO-style ranking from real human preference votes — captures real-world chat quality better than static benchmarks; includes both open and closed models

2026

Task-specific state-of-the-art leaderboards updated as new model evaluations are published — covers code (LiveCodeBench), math (MATH-500), reasoning (GPQA), multimodal (MMMU)

2026

Code generation benchmark on competition problems post-dating training cutoffs — more reliable than HumanEval for comparing current models without training data contamination

2026

Weekly English digest for Chinese AI lab model releases — Qwen3 (April 2026) and DeepSeek-R1-0528 (May 2026) covered with deployment framing absent from other English digests

2026-04

Qwen3 MoE flagship for inference efficiency: 30B total parameters, 3B active at inference — model card contains deployment cost analysis and benchmark comparison with dense equivalents

2026

Kimi model cards — Kimi k2 (March 2026) with 128K context window; primary English technical documentation for Moonshot models

2026

Aggregated API access to 50+ models with per-token pricing — practical deployment cost tracking alongside capability benchmarks

2026

Daily AI digest — covers major model releases within 4-8 hours of publication; 500K+ subscribers; consistent same-day coverage of both Western and Chinese lab releases

2026

Practitioner community for open-weight models — first user evaluations appear here within hours of release; real-world hardware performance and prompt sensitivity observations

Sources

FAQ

What is the best site to track AI model releases in 2026?

HuggingFace model pages (huggingface.co) are the best primary source — follow lab organizations for release notifications and read model cards for ground-truth technical data. For a curated daily digest, TLDR AI is the most efficient. For Chinese lab releases specifically, RadarAI covers them with builder-relevant framing. No single site is best for all purposes; the three-source stack (primary technical + daily digest + China-specific) covers the major releases comprehensively.

How do I know when a new AI model is released?

Three reliable alert methods: (1) Follow lab organizations on HuggingFace — you receive an email notification when they publish a new model; (2) Enable GitHub release notifications for organizations you track (QwenLM, deepseek-ai, mistralai, meta-llama); (3) Subscribe to TLDR AI for a daily capsule that covers releases within hours. For Chinese lab releases, RadarAI weekly digest ensures you catch anything that the above methods miss before you have a chance to see it.

How do I compare AI model benchmarks reliably?

Three-step verification: (1) Check the model card on HuggingFace for the specific benchmark version and shot count used — MMLU 5-shot and MMLU-Pro are different evaluations; (2) Cross-check against the Open LLM Leaderboard v2 score, which is independently reproduced rather than lab-reported; (3) For real-world task relevance, check the LMSYS Chatbot Arena ELO for conversational quality, and Papers With Code task-specific leaderboards for code, math, or other domains you care about. A 2-3 point MMLU difference matters less than the delta between Arena scores and task-specific benchmarks.

Are there good sites to track Chinese AI model releases specifically?

For primary source tracking: QwenLM GitHub (Qwen series), DeepSeek HuggingFace (DeepSeek series), moonshotai HuggingFace (Kimi series), and Zhipu AI HuggingFace (GLM series). For curated English-language tracking with deployment framing: RadarAI (radarai.top/en) is the most consistent English digest covering Chinese lab releases. ImportAI (Jack Clark) covers Chinese labs in a global comparative context monthly. Western tech media is not a reliable primary source for Chinese model release technical details.

What is the difference between model release tracking and model benchmarking?

Release tracking is event-driven: knowing when a new model publishes, what weights are available, and what the license terms are. Benchmarking is evaluative: determining how a model performs on specific tasks. Release tracking sources (HuggingFace, GitHub release pages, newsletters) focus on the event. Benchmarking sources (Open LLM Leaderboard, Papers With Code, LiveCodeBench, Chatbot Arena) focus on comparative evaluation, often on a continuous basis as new models enter the field. For integration decisions, you need both: confirm the release exists and license is compatible, then evaluate performance on your specific task.

How far in advance can I track upcoming AI model releases?

Most major labs (Anthropic, OpenAI) do not pre-announce releases. Chinese labs often release model weights with little or no advance notice — DeepSeek-R1-0528 had no pre-announcement. Some signals exist: ArXiv pre-prints from a lab often precede a production model release by 2-6 weeks; lab blog posts about research directions can signal upcoming capability areas; conference submissions (NeurIPS, ICML) telegraph 6-month ahead research priorities. GitHub release notes sometimes contain version tags that suggest upcoming models. But reliable advance tracking is not possible — the best approach is staying current rather than trying to predict.

Should I track AI model releases differently for open-source vs closed/API-only models?

Yes — they require different sources. Open-source models: HuggingFace model pages, GitHub release pages, r/LocalLLaMA for early community testing, Open LLM Leaderboard for independent benchmarks. Closed/API-only models (GPT-4o, Claude, Gemini): vendor changelog pages (platform.openai.com/docs/changelog, docs.anthropic.com, ai.google.dev/updates), Chatbot Arena for capability comparison, official pricing pages for API economics. For Chinese labs, the distinction matters: Qwen and DeepSeek are primarily open-weight (weights available), while Kimi and ERNIE have API-only tiers with restricted weight access — different tracking paths apply.

What does model card quality tell me about a lab?

Model card quality is a reliable signal about a lab's engineering culture and openness. A high-quality model card includes: evaluation dataset versions and shot counts for all benchmark scores, training data composition at least at category level, inference hardware requirements and context window limitations, and explicit license terms for commercial use. Labs that publish detailed model cards (QwenLM, Mistral, Meta's Llama team) are typically more transparent about model limitations than labs that publish headline scores without methodology. If a model's HuggingFace card has no evaluation section or only links to a press release, treat benchmark claims as unverified.

Search angles this page supports

best sites to track AI model releases AI model release tracker 2026 how to track new AI models where to follow AI model releases AI benchmark tracking sites Chinese AI model releases English new AI models tracking website

Go deeper

Last updated: 2026-06-26 · Policy: Editorial standards · Methodology

Decision in 20 seconds

Use this page when

This page is not for

Key points

What changed recently

Explanation

Model Release Tracking Source Selection

How to verify the answer

Tools / Examples

Evidence timeline

Sources

FAQ

What is the best site to track AI model releases in 2026?

How do I know when a new AI model is released?

How do I compare AI model benchmarks reliably?

Are there good sites to track Chinese AI model releases specifically?

What is the difference between model release tracking and model benchmarking?

How far in advance can I track upcoming AI model releases?

Should I track AI model releases differently for open-source vs closed/API-only models?

What does model card quality tell me about a lab?

Search angles this page supports

Related

Go deeper