Model Benchmarks

Comparing frontier capability against capital intensity. Cost is mapped on a logarithmic scale to account for the exponential spread between commodity and reasoning tiers.

Intelligence vs. Cost

80%
85%
90%
95%
PERFORMANCE (MMLU %)
COST ($/1M TOKENS) — LOGARITHMIC SCALE
o1-preview
GPT-4o
Claude 3.5 Sonnet
Gemini 1.5 Pro
Llama 3.1 405B
Llama 3.1 70B
Mistral Large 2
DeepSeek V3
MODELMMLUCODINGCOST/1MDEVELOPER
o1-preview
92.3%90%$15.00OpenAI
GPT-4o
88.7%90.2%$2.50OpenAI
DeepSeek V3
88.5%91%$0.20DeepSeek
Claude 3.5 Sonnet
88.3%92%$3.00Anthropic
Gemini 1.5 Pro
85.9%84.1%$3.50Google
Llama 3.1 405B
85.2%81.7%$5.00Meta
Llama 3.1 70B
82%74%$0.90Meta
Mistral Large 2
81.2%45.1%$4.00Mistral
ORIBOS PRO

Interactive Benchmark Matrix

Unlock our full proprietary intelligence.

Upgrade to PRO — $29/mo →

The Performance Leaders

Models like Claude 3.5 Sonnet and GPT-4o offer the best balance of performance and value. DeepSeek V3 has recently set a new industry floor for cost-efficiency.

Reasoning & Logic

The o1-series represents a premium tier for advanced reasoning tasks. While more expensive, these models offer significantly higher performance on complex math and coding benchmarks.