Reasoning-Benchmarks

Allgemeine Denkfähigkeit, Anweisungsfolgen und Sprachverständnis

Letzte Aktualisierung: 2026-05-31 Quelle: LMSYS MT-bench Skala: 1–10 (höher = besser) 67 Modelle
# Modell Anbieter MT-bench Score
1 GPT-4-1106-preview OpenAI
9.32
2 GPT-4-0613 OpenAI
9.18
3 Qwen2-72B-Instruct Alibaba
9.12
4 GPT-4-0314 OpenAI
8.96
5 Qwen1.5-110B-Chat Alibaba
8.88
7 Qwen1.5-72B-Chat Alibaba
8.61
6 Mistral Medium Mistral
8.61
8 GPT-3.5-Turbo-0613 OpenAI
8.39
9 GPT-3.5-Turbo-1106 OpenAI
8.32
10 Mixtral-8x7B-Instruct-v0.1 Mistral
8.30
11 Qwen1.5-32B-Chat Alibaba
8.30
12 Claude-2.1 Anthropic
8.18
13 Starling-LM-7B-beta Nexusflow
8.12
14 Starling-LM-7B-alpha UC Berkeley
8.09
15 Claude-2.0 Anthropic
8.06
16 GPT-3.5-Turbo-0314 OpenAI
7.94
17 Qwen1.5-14B-Chat Alibaba
7.91
18 Claude-1 Anthropic
7.90
19 Tulu-2-DPO-70B AllenAI/UW
7.89
20 Claude-Instant-1 Anthropic
7.85
21 OpenChat-3.5 OpenChat
7.81
22 OpenChat-3.5-0106 OpenChat
7.80
23 WizardLM-70B-v1.0 Microsoft
7.71
25 Mistral-7B-Instruct-v0.2 Mistral
7.60
24 Qwen1.5-7B-Chat Alibaba
7.60
26 SOLAR-10.7B-Instruct-v1.0 Upstage AI
7.58
27 NV-Llama2-70B-SteerLM-Chat Nvidia
7.54
28 Zephyr-7B-beta HuggingFace
7.34
29 WizardLM-13b-v1.2 Microsoft
7.20
30 Vicuna-33B LMSYS
7.12
31 WizardLM-30B Microsoft
7.01
32 Qwen-14B-Chat Alibaba
6.96
33 Vicuna-13B-16k LMSYS
6.92
34 Zephyr-7B-alpha HuggingFace
6.88
35 Llama-2-70B-chat Meta
6.86
36 Mistral-7B-Instruct-v0.1 Mistral
6.84
37 WizardLM-13B-v1.1 Microsoft
6.76
38 Llama-2-13b-chat Meta
6.65
39 Vicuna-13B LMSYS
6.57
40 Guanaco-33B UW
6.53
41 Tulu-30B AllenAI/UW
6.43
43 OpenAssistant-LLaMA-30B OpenAssistant
6.41
42 Guanaco-65B UW
6.41
44 PaLM-Chat-Bison-001 Google
6.40
45 MPT-30B-chat MosaicML
6.39
46 WizardLM-13B-v1.0 Microsoft
6.35
47 Llama-2-7B-chat Meta
6.27
48 Vicuna-7B-16k LMSYS
6.22
49 Vicuna-7B LMSYS
6.17
50 Baize-v2-13B UCSD
5.75
51 XGen-7B-8K-Inst Salesforce
5.55
52 Nous-Hermes-13B NousResearch
5.51
53 MPT-7B-Chat MosaicML
5.42
54 GPT4All-13B-Snoozy Nomic AI
5.41
55 Koala-13B UC Berkeley
5.35
56 MPT-30B-Instruct MosaicML
5.22
57 Falcon-40B-Instruct TII
5.17
58 ChatGLM2-6B Tsinghua
4.96
59 H2O-Oasst-OpenLLaMA-13B h2oai
4.63
60 Alpaca-13B Stanford
4.53
61 ChatGLM-6B Tsinghua
4.50
62 OpenAssistant-Pythia-12B OpenAssistant
4.32
63 RWKV-4-Raven-14B RWKV
3.98
64 Dolly-V2-12B Databricks
3.28
65 FastChat-T5-3B LMSYS
3.04
66 StableLM-Tuned-Alpha-7B Stability AI
2.75
67 LLaMA-13B Meta
2.61