Dynamic Evaluation Benchmark¶

DyVal is a new dynamic evaluation protocol for LLMs. More information can be found at DyVal: Graph-informed Dynamic Evaluation of Large Language Models.

Please contact us if you want the results of your models shown in this leaderboard.

All results¶

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.79	-	50.76	37.04	21.10	21.58	-
LLaMA2-13B Chat	8.33	-	16.15	35.72	7.73	28.05	-
ChatGPT	84.50	26.63	97.34	66.56	52.49	56.09	13.63
GPT4	89.88	45.03	99.33	93.92	66.33	79.02	23.36

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	1.89	-	81.33	25.73	44.51	21.60	-
LLaMA2-13B Chat	25.07	-	19.20	50.27	1.82	27.62	-
ChatGPT	95.27	36.22	99.09	81.96	41.78	62.27	28.14
GPT4	99.00	57.05	100.00	94.45	89.29	87.22	31.56

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.73	-	55.11	43.87	22.42	21.84	-
LLaMA2-13B Chat	4.44	-	14.51	40.38	16.56	29.25	-
ChatGPT	91.60	29.39	98.33	64.75	56.62	54.84	12.95
GPT4	95.11	42.61	99.78	96.06	63.61	86.33	30.45

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.47	-	37.02	42.77	17.47	21.69	-
LLaMA2-13B Chat	2.20	-	17.18	28.78	9.42	27.38	-
ChatGPT	77.62	24.31	96.84	62.80	58.27	53.64	7.47
GPT4	85.95	43.78	99.00	94.78	57.67	71.17	18.33

Model	Arithmetic	Linear Equation	Boolean Logic	Deductive Logic	Abductive Logic	Reachability	Max Sum Path
Vicuna-13B v1.3	0.09	-	29.58	36.29	0.0	21.18	-
LLaMA2-13B Chat	1.60	-	13.71	23.47	3.13	27.96	-
ChatGPT	71.51	16.60	95.11	56.73	53.29	53.62	5.98
GPT4	79.44	36.67	98.56	90.39	54.78	71.33	13.11