metrics.eval¶

class promptbench.metrics.eval.Eval¶

Bases: object

A utility class for computing various evaluation metrics.

This class provides static methods to compute metrics such as classification accuracy, SQuAD V2 F1 score, BLEU score, and math accuracy.

Methods:¶

compute_cls_accuracy(preds, gts): Computes classification accuracy.
compute_squad_v2_f1(preds, gts, dataset): Computes the F1 score for the SQuAD V2 dataset.
compute_bleu(preds, gts): Computes the BLEU score for translation tasks.
compute_math_accuracy(dataset, preds, gts): Computes accuracy for math dataset.

static compute_bleu(preds, gts)¶

Computes the BLEU score for translation tasks.

: float

The BLEU score.

static compute_cider(preds, gts)¶

Computes the CIDEr score for image captioning tasks.

: float

The CIDEr score.

static compute_cls_accuracy(preds, gts)¶

Computes classification accuracy based on predictions and ground truths.

: float

The classification accuracy.

static compute_math_accuracy(preds, gts)¶

Computes accuracy for the ‘math’ dataset.

: float

The math accuracy.

static compute_squad_v2_f1(preds, gts, dataset)¶

Computes the F1 score for the SQuAD V2 dataset.

: float

The F1 score for the SQuAD V2 dataset.

static compute_vqa_accuracy(preds, gts)¶

Computes vqa accuracy for the VQAv2 dataset.

: float

The vqa accuracy.