Quantitative Intelligence

Model Benchmarks

Comparing frontier capability against capital intensity. Cost is mapped on a logarithmic scale to account for the exponential spread between commodity and reasoning tiers.

Intelligence vs. Cost

80%

85%

90%

95%

PERFORMANCE (MMLU %)

COST ($/1M TOKENS) — LOGARITHMIC SCALE

o1-preview

GPT-4o

Claude 3.5 Sonnet

Gemini 1.5 Pro

Llama 3.1 405B

Llama 3.1 70B

Mistral Large 2

DeepSeek V3

MODEL	MMLU	CODING	COST/1M	DEVELOPER
o1-preview	92.3%	90%	$15.00	OpenAI
GPT-4o	88.7%	90.2%	$2.50	OpenAI
DeepSeek V3	88.5%	91%	$0.20	DeepSeek
Claude 3.5 Sonnet	88.3%	92%	$3.00	Anthropic
Gemini 1.5 Pro	85.9%	84.1%	$3.50	Google
Llama 3.1 405B	85.2%	81.7%	$5.00	Meta
Llama 3.1 70B	82%	74%	$0.90	Meta
Mistral Large 2	81.2%	45.1%	$4.00	Mistral

ORIBOS PRO

Interactive Benchmark Matrix

Unlock our full proprietary intelligence.

Upgrade to PRO — $29/mo →

The Performance Leaders

Models like Claude 3.5 Sonnet and GPT-4o offer the best balance of performance and value. DeepSeek V3 has recently set a new industry floor for cost-efficiency.

Reasoning & Logic

The o1-series represents a premium tier for advanced reasoning tasks. While more expensive, these models offer significantly higher performance on complex math and coding benchmarks.