Model Intelligence

AI Model Leaderboards

Ranked performance across the benchmarks that define frontier AI capability. Sourced from official lab evaluations and independent harnesses.

Live Data • Updated May 4, 2026

5 models tracked

Weekly refresh

Filter by type

Graduate-Level Science (GPQA Diamond)

198 expert-validated questions in biology, chemistry, and physics. Designed to be unsolvable by non-experts even with internet access.

Rank

Model

Developer

Accuracy (%)

Type

Coding & Software Engineering (SWE-bench)

SWE-bench Verified evaluates real GitHub issue resolution. The practical test of AI engineering capability.

Rank

Model

Developer

Pass Rate (%)

Type

Claude Mythos Preview

swe bench verified

swe bench verified

General Reasoning (MMLU Pro)

A more rigorous version of the standard MMLU benchmark, requiring deeper reasoning across academic disciplines.

Rank

Model

Developer

Accuracy (%)

Type

Gemini 3 Pro Preview

Google DeepMind

Claude 3.5 Sonnet

Methodology

Benchmark scores are sourced from official lab technical reports, peer-reviewed papers, and independent evaluation harnesses (EleutherAI LM Eval Harness, BIG-Bench, HELM). Where labs use different evaluation protocols, we note the specific setup. Scores reflect the best reported result at the time of model release or latest update. This page is maintained by the ORIBOS editorial team and updated as new evaluations are published.

View all model profiles →AI Research topic →