Prompt Engineering Benchmark

The Prompt Engineering Module collects a variety of prompting methods and evaluates their performance across multiple datasets. This module currently supports models including GPT-3.5-turbo and GPT-4-1106.

Please contact us if you want the results of your models shown in this leaderboard.

All Results

GPT3.5 -Turbo

GPT4-1106

gsm8k

bigbench_date

bigbench_object_tracking

csqa

gsm8k

bigbench_date

bigbench_object_tracking

csqa

baseline

47.15

57.99

39.2

72.48

92.19

87.80

96.27

79.69

CoT

40.33

49.32

63.20

67.81

85.89

92.14

90.26

85.59

CoT(zero-shot)

18.50

80.49

66.00

65.85

87.34

87.53

99.07

79.85

expert prompting

21.15

61.79

56.53

74.45

88.70

87.26

98.93

79.85

emotion prompt

57.24

66.12

29.87

70.68

90.83

87.80

95.73

80.34

GPT3.5-Turbo

GPT4-1106

gsm8k

last-letter-concat

gsm8k

last-letter-concat

baseline

47.15

7.2

92.19

25.2

least to most

75.28

79.8

79.38

96.2

“This is very important to my career.” is used in emotion prompt