AI Model Leaderboards

Ranked performance across the benchmarks that define frontier AI capability. Sourced from official lab evaluations and independent harnesses.

Live Data • Updated May 4, 2026
5 models tracked
Weekly refresh
Filter by type

198 expert-validated questions in biology, chemistry, and physics. Designed to be unsolvable by non-experts even with internet access.

Rank
Model
Developer
Accuracy (%)
Type
01
NEW MODEL
gpqa diamond
TEST
99.9%
Reasoning

SWE-bench Verified evaluates real GitHub issue resolution. The practical test of AI engineering capability.

A more rigorous version of the standard MMLU benchmark, requiring deeper reasoning across academic disciplines.

Benchmark scores are sourced from official lab technical reports, peer-reviewed papers, and independent evaluation harnesses (EleutherAI LM Eval Harness, BIG-Bench, HELM). Where labs use different evaluation protocols, we note the specific setup. Scores reflect the best reported result at the time of model release or latest update. This page is maintained by the ORIBOS editorial team and updated as new evaluations are published.

View all model profiles →AI Research topic →