Prompt Engineering Benchmark¶

The Prompt Engineering Module collects a variety of prompting methods and evaluates their performance across multiple datasets. This module currently supports models including GPT-3.5-turbo and GPT-4-1106.

Please contact us if you want the results of your models shown in this leaderboard.

All Results¶

	GPT3.5 -Turbo				GPT4-1106
	gsm8k	bigbench_date	bigbench_object_tracking	csqa	gsm8k	bigbench_date	bigbench_object_tracking	csqa
baseline	47.15	57.99	39.2	72.48	92.19	87.80	96.27	79.69
CoT	40.33	49.32	63.20	67.81	85.89	92.14	90.26	85.59
CoT(zero-shot)	18.50	80.49	66.00	65.85	87.34	87.53	99.07	79.85
expert prompting	21.15	61.79	56.53	74.45	88.70	87.26	98.93	79.85
emotion prompt	57.24	66.12	29.87	70.68	90.83	87.80	95.73	80.34

	GPT3.5-Turbo		GPT4-1106
	gsm8k	last-letter-concat	gsm8k	last-letter-concat
baseline	47.15	7.2	92.19	25.2
least to most	75.28	79.8	79.38	96.2

“This is very important to my career.” is used in emotion prompt