LIVE BENCHMARKS

AI model benchmarks — plus the one number no one else publishes.

Intelligence Index, GPQA Diamond, MATH-500, LiveCodeBench, MMLU-Pro, AIME 2025 — refreshed as labs publish. Then build a council and see how its ceiling compares to the best individual model.

Run your own benchmark Try the council builder

Full benchmark table

Model Intelligence IndexGPQA DiamondMATH-500LiveCodeBenchMMLU-ProAIME 2025 Context Updated
Claude Opus 4.7
Anthropic
76.089.096.084.087.591.0 200k 2026-06-03
GPT-5.1
OpenAI
73.087.096.581.085.092.0 256k 2026-06-03
Claude Sonnet 4.5
Anthropic
70.084.094.079.084.088.0 200k 2026-06-03
Qwen 3.6 Plus
Alibaba
66.079.091.072.078.082.0 1000k 2026-06-03
Kimi K2.5
Moonshot
64.076.090.070.076.080.0 200k 2026-06-03
GLM 5.1
Zhipu
60.072.085.065.073.0 128k 2026-06-03
Command A
Cohere
256k 2026-06-04
Gemini 3.1 Pro
Google
1049k 2026-06-04
Grok 4.3
xAI
1000k 2026-06-04
DeepSeek V4 Pro
DeepSeek
1049k 2026-06-04
Llama 3.3 70B
Meta
131k 2026-06-04
Mistral Large 3
Mistral
262k 2026-06-04

What each benchmark measures

The Council Ceiling — the number no one else publishes

For each benchmark suite, we compute the maximum score across the council members you select. That number is the upper bound on what a synthesis chairman could produce from the council's responses. A council of 4 frontier models routinely beats the best individual model on 4 or 5 of the 6 suites simultaneously — that gap is the value of multi-model deliberation, made empirical.

How to read the live numbers

A model that wins Intelligence Index but loses LiveCodeBench is great at reasoning narratives, weak at producing working code under time pressure. A model that wins MATH-500 but loses MMLU-Pro is strong at math-shape problems, weaker on cross-domain knowledge. No model wins everything.

A model dominating only one suite is suspicious. A model in the top quartile of three suites is real.

Why not trust vendor benchmarks

Every lab publishes the suites where they win, and trains extensively to win specific benchmarks. The way to spot this is to look for live benchmarks released after the model's training cutoff (LiveCodeBench, AIME 2025) and to cross-reference multiple suites. A model dominating only one suite is suspicious; a model in the top quartile of three suites is real.

Frequently asked questions

How often are benchmarks updated?

Weekly. We refresh scores from Artificial Analysis, LiveCodeBench, and lab-published reports. Each row shows a last-updated date.

Where do the scores come from?

Third-party sources where they exist (Artificial Analysis, LiveCodeBench leaderboards), and lab-published model cards otherwise. We never average vendor charts.

Why is a cell blank?

If a row is blank, the lab has not released a score and the third-party benchmark has not run them yet. We do not fabricate numbers.

How is the Council Ceiling calculated?

For each suite, we take the maximum score across the models you selected. It represents the upper bound on what a synthesis chairman could weave from the council's responses.

Can I run my own benchmark on these models?

Yes. Sign up free, build a custom council, and run your own prompts side by side.

Why not trust vendor benchmarks?

Every lab publishes the suites where it wins. The defense is cross-referencing multiple suites and weighting newer, contamination-resistant ones higher.