LiveBench.ai Data Analysis

Performance Overview Category Comparison Model Details

Top Models Performance Overview

NormScore - Livebench Average Score Coding Data Analysis IF Language Mathematics Reasoning

Rank	Model Name	NormScore - Livebench	Coding	Data Analysis	IF	Language	Mathematics	Reasoning
1	o3 High	80.707	80.759	78.445	81.813	79.864	79.021	83.234
2	o3 Medium	78.933	82.091	79.655	80.048	76.571	74.950	81.201
3	o4-Mini High	78.116	84.271	81.060	80.653	67.464	78.834	78.599
4	Gemini 2.5 Pro Preview	77.443	74.948	76.161	76.509	74.841	83.062	78.191
5	Claude 3.7 Sonnet Thinking	74.983	77.127	82.618	77.146	73.864	73.645	68.034
6	o4-Mini Medium	73.752	78.136	80.213	77.659	63.538	75.241	70.038
7	Qwen 3 235B A22B	73.573	68.813	81.021	83.281	62.604	73.282	70.099
8	DeepSeek R1	72.047	79.024	81.385	76.409	57.950	72.506	68.991
9	Qwen 3 32B	71.429	67.643	79.847	80.862	58.699	70.425	69.496
10	Grok 3 Mini Beta (High)	70.778	57.270	74.695	74.716	63.603	71.374	78.496

Category Performance Comparison

모델	Coding	Data Analysis	IF	Language	Mathematics	Reasoning
o3 High	76.715	67.020	86.175	75.996	85.004	93.333
o3 Medium	77.863	68.193	84.321	73.481	80.657	91.000
o4-Mini High	79.976	68.328	84.958	66.055	84.895	88.111
Gemini 2.5 Pro Preview	71.081	62.475	80.592	69.314	89.157	87.528
Claude 3.7 Sonnet Thinking	73.194	69.107	81.254	68.269	78.999	76.167
o4-Mini Medium	74.219	68.472	81.825	62.409	81.020	78.472
Qwen 3 235B A22B	65.325	68.308	87.729	60.609	78.778	78.611
DeepSeek R1	74.985	69.625	80.508	54.771	77.910	77.167
Qwen 3 32B	64.238	68.289	85.171	55.153	75.583	77.750
Grok 3 Mini Beta (High)	54.516	64.578	78.704	59.087	77.005	87.611

Category	Benchmarks
Coding	code_completion, code_generation
Data Analysis	tablejoin, tablereformat
IF	paraphrase, simplify, story_generation, summarize
Language	connections, plot_unscrambling, typos
Mathematics	AMPS_Hard, math_comp, olympiad
Reasoning	spatial, web_of_lies_v3, zebra_puzzle

Select Model for Detailed Analysis:

-- Select a Model --

-- Select a Model -- ChatGPT-4oClaude 3.5 HaikuClaude 3.5 SonnetClaude 3.7 SonnetClaude 3.7 Sonnet ThinkingCommand RCommand R PlusDeepSeek R1DeepSeek R1 Distill Llama 70BDeepSeek R1 Distill Qwen 32BDeepSeek V3.1Dracarys2 72B InstructDracarys2 Llama 3.1 70B InstructGemini 2.0 FlashGemini 2.0 Flash LiteGemini 2.5 Flash PreviewGemini 2.5 Pro PreviewGemma 3 27BGPT-4.1GPT-4.1 MiniGPT-4.1 NanoGPT-4.5 PreviewGPT-4oGPT-4o MiniGrok 3 BetaGrok 3 Mini Beta (High)Hunyuan TurbosLearnLM 1.5 Pro ExperimentalLearnLM 2.0 Flash ExperimentalLlama 3.3 70B Instruct TurboLlama 4 Maverick 17B 128E InstructMistral LargeMistral SmallNova LiteNova MicroNova Proo3 Higho3 Mediumo4-Mini Higho4-Mini MediumQwen 3 235B A22BQwen 3 30B A3BQwen 3 32BQwen2.5 72B Instruct TurboQwen2.5 7B Instruct TurboQwen2.5 MaxQwQ 32BStep 2 16K

-- Select a Model --

Select a Model

-- Select a Model --
ChatGPT-4o
Claude 3.5 Haiku
Claude 3.5 Sonnet
Claude 3.7 Sonnet
Claude 3.7 Sonnet Thinking
Command R
Command R Plus
DeepSeek R1
DeepSeek R1 Distill Llama 70B
DeepSeek R1 Distill Qwen 32B
DeepSeek V3.1
Dracarys2 72B Instruct
Dracarys2 Llama 3.1 70B Instruct
Gemini 2.0 Flash
Gemini 2.0 Flash Lite
Gemini 2.5 Flash Preview
Gemini 2.5 Pro Preview
Gemma 3 27B
GPT-4.1
GPT-4.1 Mini
GPT-4.1 Nano
GPT-4.5 Preview
GPT-4o
GPT-4o Mini
Grok 3 Beta
Grok 3 Mini Beta (High)
Hunyuan Turbos
LearnLM 1.5 Pro Experimental
LearnLM 2.0 Flash Experimental
Llama 3.3 70B Instruct Turbo
Llama 4 Maverick 17B 128E Instruct
Mistral Large
Mistral Small
Nova Lite
Nova Micro
Nova Pro
o3 High
o3 Medium
o4-Mini High
o4-Mini Medium
Qwen 3 235B A22B
Qwen 3 30B A3B
Qwen 3 32B
Qwen2.5 72B Instruct Turbo
Qwen2.5 7B Instruct Turbo
Qwen2.5 Max
QwQ 32B
Step 2 16K

Please select a model to view detailed performance.

LiveBench.ai Data Analysis

Top Models Performance Overview

Top Models by NormScore - Livebench

Category Performance Comparison

모델별 카테고리 점수

Categories and Benchmarks

Select a Model

NormScore - Livebench 계산 방식 및 장점

라이선스 정보