Dynamic Evaluation Benchmark¶
DyVal is a new dynamic evaluation protocol for LLMs. More information can be found at DyVal: Graph-informed Dynamic Evaluation of Large Language Models.
Please contact us if you want the results of your models shown in this leaderboard.
[All results] [View by Complexity]
All results¶
Model |
Arithmetic |
Linear Equation |
Boolean Logic |
Deductive Logic |
Abductive Logic |
Reachability |
Max Sum Path |
---|---|---|---|---|---|---|---|
Vicuna-13B v1.3 |
0.79 |
- |
50.76 |
37.04 |
21.10 |
21.58 |
- |
LLaMA2-13B Chat |
8.33 |
- |
16.15 |
35.72 |
7.73 |
28.05 |
- |
ChatGPT |
84.50 |
26.63 |
97.34 |
66.56 |
52.49 |
56.09 |
13.63 |
GPT4 |
89.88 |
45.03 |
99.33 |
93.92 |
66.33 |
79.02 |
23.36 |
View by Complexity¶
Complexity 1¶
Model |
Arithmetic |
Linear Equation |
Boolean Logic |
Deductive Logic |
Abductive Logic |
Reachability |
Max Sum Path |
---|---|---|---|---|---|---|---|
Vicuna-13B v1.3 |
1.89 |
- |
81.33 |
25.73 |
44.51 |
21.60 |
- |
LLaMA2-13B Chat |
25.07 |
- |
19.20 |
50.27 |
1.82 |
27.62 |
- |
ChatGPT |
95.27 |
36.22 |
99.09 |
81.96 |
41.78 |
62.27 |
28.14 |
GPT4 |
99.00 |
57.05 |
100.00 |
94.45 |
89.29 |
87.22 |
31.56 |
Complexity 2¶
Model |
Arithmetic |
Linear Equation |
Boolean Logic |
Deductive Logic |
Abductive Logic |
Reachability |
Max Sum Path |
---|---|---|---|---|---|---|---|
Vicuna-13B v1.3 |
0.73 |
- |
55.11 |
43.87 |
22.42 |
21.84 |
- |
LLaMA2-13B Chat |
4.44 |
- |
14.51 |
40.38 |
16.56 |
29.25 |
- |
ChatGPT |
91.60 |
29.39 |
98.33 |
64.75 |
56.62 |
54.84 |
12.95 |
GPT4 |
95.11 |
42.61 |
99.78 |
96.06 |
63.61 |
86.33 |
30.45 |
Complexity 3¶
Model |
Arithmetic |
Linear Equation |
Boolean Logic |
Deductive Logic |
Abductive Logic |
Reachability |
Max Sum Path |
---|---|---|---|---|---|---|---|
Vicuna-13B v1.3 |
0.47 |
- |
37.02 |
42.77 |
17.47 |
21.69 |
- |
LLaMA2-13B Chat |
2.20 |
- |
17.18 |
28.78 |
9.42 |
27.38 |
- |
ChatGPT |
77.62 |
24.31 |
96.84 |
62.80 |
58.27 |
53.64 |
7.47 |
GPT4 |
85.95 |
43.78 |
99.00 |
94.78 |
57.67 |
71.17 |
18.33 |
Complexity 4¶
Model |
Arithmetic |
Linear Equation |
Boolean Logic |
Deductive Logic |
Abductive Logic |
Reachability |
Max Sum Path |
---|---|---|---|---|---|---|---|
Vicuna-13B v1.3 |
0.09 |
- |
29.58 |
36.29 |
0.0 |
21.18 |
- |
LLaMA2-13B Chat |
1.60 |
- |
13.71 |
23.47 |
3.13 |
27.96 |
- |
ChatGPT |
71.51 |
16.60 |
95.11 |
56.73 |
53.29 |
53.62 |
5.98 |
GPT4 |
79.44 |
36.67 |
98.56 |
90.39 |
54.78 |
71.33 |
13.11 |