promptbench Introduction¶

PromptBench is a unified library for evaluating and understanding large language models.

What does promptbench currently contain?¶

Quick access your model performance: We provide a user-friendly interface for quick build models, load dataset, and evaluate model performance.
Prompt Engineering:
Evaluating adversarial prompts: promptbench integrated prompt attacks [1] for researchers simulate black-box adversarial prompt attacks on the models and evaluate their performances.
Dynamic evaluation to mitigate potential test data contamination: we integrated the dynamic evaluation framework DyVal [2], which generates evaluation samples on-the-fly with controlled complexity.

If you want to

evaluate my model on existing benchmarks: please refer to the examples/basic.ipynb for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to examples/multimodal.ipynb.
test the effects of different prompting techniques:
examine the robustness for prompt attacks, please refer to examples/prompt_attack.ipynb to construct the attacks.
use DyVal for evaluation: please refer to examples/dyval.ipynb to construct DyVal datasets.