Dynamic Evaluation Benchmark

DyVal is a new dynamic evaluation protocol for LLMs. More information can be found at DyVal: Graph-informed Dynamic Evaluation of Large Language Models.

Please contact us if you want the results of your models shown in this leaderboard.

[All results] [View by Complexity]

All results

Model

Arithmetic

Linear Equation

Boolean Logic

Deductive Logic

Abductive Logic

Reachability

Max Sum Path

Vicuna-13B v1.3

0.79

-

50.76

37.04

21.10

21.58

-

LLaMA2-13B Chat

8.33

-

16.15

35.72

7.73

28.05

-

ChatGPT

84.50

26.63

97.34

66.56

52.49

56.09

13.63

GPT4

89.88

45.03

99.33

93.92

66.33

79.02

23.36

View by Complexity

Complexity 1

Model

Arithmetic

Linear Equation

Boolean Logic

Deductive Logic

Abductive Logic

Reachability

Max Sum Path

Vicuna-13B v1.3

1.89

-

81.33

25.73

44.51

21.60

-

LLaMA2-13B Chat

25.07

-

19.20

50.27

1.82

27.62

-

ChatGPT

95.27

36.22

99.09

81.96

41.78

62.27

28.14

GPT4

99.00

57.05

100.00

94.45

89.29

87.22

31.56

Complexity 2

Model

Arithmetic

Linear Equation

Boolean Logic

Deductive Logic

Abductive Logic

Reachability

Max Sum Path

Vicuna-13B v1.3

0.73

-

55.11

43.87

22.42

21.84

-

LLaMA2-13B Chat

4.44

-

14.51

40.38

16.56

29.25

-

ChatGPT

91.60

29.39

98.33

64.75

56.62

54.84

12.95

GPT4

95.11

42.61

99.78

96.06

63.61

86.33

30.45

Complexity 3

Model

Arithmetic

Linear Equation

Boolean Logic

Deductive Logic

Abductive Logic

Reachability

Max Sum Path

Vicuna-13B v1.3

0.47

-

37.02

42.77

17.47

21.69

-

LLaMA2-13B Chat

2.20

-

17.18

28.78

9.42

27.38

-

ChatGPT

77.62

24.31

96.84

62.80

58.27

53.64

7.47

GPT4

85.95

43.78

99.00

94.78

57.67

71.17

18.33

Complexity 4

Model

Arithmetic

Linear Equation

Boolean Logic

Deductive Logic

Abductive Logic

Reachability

Max Sum Path

Vicuna-13B v1.3

0.09

-

29.58

36.29

0.0

21.18

-

LLaMA2-13B Chat

1.60

-

13.71

23.47

3.13

27.96

-

ChatGPT

71.51

16.60

95.11

56.73

53.29

53.62

5.98

GPT4

79.44

36.67

98.56

90.39

54.78

71.33

13.11