metrics.eval

class promptbench.metrics.eval.Eval

Bases: object

A utility class for computing various evaluation metrics.

This class provides static methods to compute metrics such as classification accuracy, SQuAD V2 F1 score, BLEU score, and math accuracy.

Methods:

compute_cls_accuracy(preds, gts)

Computes classification accuracy.

compute_squad_v2_f1(preds, gts, dataset)

Computes the F1 score for the SQuAD V2 dataset.

compute_bleu(preds, gts)

Computes the BLEU score for translation tasks.

compute_math_accuracy(dataset, preds, gts)

Computes accuracy for math dataset.

static compute_bleu(preds, gts)

Computes the BLEU score for translation tasks.

Parameters:

predslist

A list of predictions.

gtslist

A list of ground truth translations.

Returns:

: float

The BLEU score.

static compute_cider(preds, gts)

Computes the CIDEr score for image captioning tasks.

Parameters:

predslist

A list of predictions.

gtslist

A list of ground truth captions.

Returns:

: float

The CIDEr score.

static compute_cls_accuracy(preds, gts)

Computes classification accuracy based on predictions and ground truths.

Parameters:

predslist

A list of predictions.

gtslist

A list of ground truths.

Returns:

: float

The classification accuracy.

static compute_math_accuracy(preds, gts)

Computes accuracy for the ‘math’ dataset.

Parameters:

datasetlist

The dataset containing math data.

predslist

A list of predictions.

gtslist

A list of ground truths.

Returns:

: float

The math accuracy.

static compute_squad_v2_f1(preds, gts, dataset)

Computes the F1 score for the SQuAD V2 dataset.

Parameters:

predslist

A list of predictions.

gtslist

A list of ground truth IDs.

datasetlist

The dataset containing the SQuAD V2 data.

Returns:

: float

The F1 score for the SQuAD V2 dataset.

static compute_vqa_accuracy(preds, gts)

Computes vqa accuracy for the VQAv2 dataset.

Parameters:

predslist

A list of predictions.

gtslist

A list of answers.

Returns:

: float

The vqa accuracy.