dataload.dataset¶

class promptbench.dataload.dataset.AI2D¶

Bases: Dataset

AI2D is a dataset class for the AI2D dataset. This dataset is loaded from huggingface datasets: ai2d (test set).

Reference: https://huggingface.co/datasets/lmms-lab/ai2d A Diagram Is Worth A Dozen Images (https://arxiv.org/abs/1603.07396)

Example data format: {

‘question’: ‘which of these define dairy item’, ‘options’: [‘c’, ‘D’, ‘b’, ‘a’], ‘answer’: ‘1’, ‘image’: <PIL.PngImagePlugin.PngImageFile image mode=RGB size=600x449>

}

class promptbench.dataload.dataset.ARC(name)¶

Bases: Dataset

ARC is a dataset class for the AI2 Reasoning Challenge dataset. This dataset is loaded from huggingface datasets: arc (test set).

Reference: https://huggingface.co/datasets/ai2_arc AI2 Reasoning Challenge (ARC) (https://arxiv.org/abs/1803.05457)

Example data format: {

‘id’: ‘Mercury_7175875’, ‘question’: ‘An astronomer observes that a planet rotates faster after a meteorite impact. Which is the most likely effect of this increase in rotation?’, ‘choices’: {‘text’: [‘Planetary density will decrease.’, ‘Planetary years will become longer.’, ‘Planetary days will become shorter.’, ‘Planetary gravity will become stronger.’], ‘label’: [‘A’, ‘B’, ‘C’, ‘D’]}, ‘answerKey’: ‘C’

}

class promptbench.dataload.dataset.BBH¶

Bases: Dataset

BBH is a dataset class for the BigBench Hard dataset. This dataset is loaded from huggingface datasets: lukaemon/bbh (test set).

Reference: https://huggingface.co/datasets/lukaemon/bbh https://github.com/suzgunmirac/BIG-Bench-Hard

Example data format: {‘input’: ‘not ( True ) and ( True ) is’, ‘target’: ‘False’, ‘task’: ‘boolean_expressions’}

class promptbench.dataload.dataset.BigBench(dataset_name)¶

Bases: Dataset

BigBench is a dataset class that loads questions and answers from the BigBench benchmark (sub-dataset: date and object tracking). It includes various types of questions, such as date understanding, object tracking. The dataset is loaded from a JSON file containing questions and answers.

date: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/date_understanding/task.json object_tracking: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/logical_deduction/three_objects/task.json

Example data format: [

{
“input”: “On a shelf, there are three books: a black book, an orange book, and a blue book. The blue book is to the right of the orange book. The orange book is to the right of the black book.”, “target_scores”: {

“The black book is the leftmost.”: 1, “The orange book is the leftmost.”: 0, “The blue book is the leftmost.”: 0

}

},

extract_answer(output)¶

class promptbench.dataload.dataset.BoolLogic¶

Bases: Dataset

BoolLogic is a dataset class (obtained from BigBench) representing boolean logic expressions. Each entry in the dataset consists of a boolean logic expression and its corresponding label (either ‘true’ or ‘false’). The dataset is read from a JSON file containing questions and answers.

https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/boolean_expressions

Example data format: [{‘content’: ‘not ( not False and True ) or not True is ‘, ‘label’: ‘false’}, …]

class promptbench.dataload.dataset.CSQA¶

Bases: Dataset

CSQA is a dataset class that loads questions and answers from the CommonsenseQA dataset. The dataset is a challenging commonsense question-answering dataset. It comprises 12,247 questions with 5 multiple-choice answers each. CSQA dataset now is loaded from huggingface datasets: /commonsense_qa (val set).

Reference: https://huggingface.co/datasets/commonsense_qa/viewer/default/validation

Example data format: [‘id’: “1afa02df02c908a558b4036e80242fac”, ‘question’: “A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?”, ‘question_concept’: “revolving door”, ‘choices’: { “label”: [ “A”, “B”, “C”, “D”, “E” ], “text”: [ “bank”, “library”, “department store”, “mall”, “new york” ] }, ‘answerKey’: “A” ]

extract_answer(output)¶

class promptbench.dataload.dataset.ChartQA¶

Bases: Dataset

ChartQA is a dataset class for the ChartQA dataset. This dataset is loaded from huggingface datasets: chart_qa (test set).

Reference: https://huggingface.co/datasets/lmms-lab/ChartQA ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning (https://arxiv.org/abs/2203.10244)

Example data format: {

‘type’: ‘human_test’, ‘question’: ‘How many food item is shown in the bar graph?’, ‘answer’: ‘14’, ‘image’: <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=850x600>

}

class promptbench.dataload.dataset.DROP¶

Bases: Dataset

DROP is a dataset class for the DROP dataset. This dataset is loaded from huggingface datasets: drop (validation set).

Reference: https://huggingface.co/datasets/drop DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (https://arxiv.org/abs/1903.00161)

Example data format: {

‘section_id’: ‘nfl_1184’, ‘query_id’: ‘f37e81fa-ef7b-4583-b671-762fc433faa9’, ‘passage’: ” Hoping to rebound from their loss to the Patriots, the Raiders stayed at home for a Week 16 duel with the Houston Texans. Oakland would get the early lead in the first quarter as quarterback JaMarcus Russell completed a 20-yard touchdown pass to rookie wide receiver Chaz Schilens. The Texans would respond with fullback Vonta Leach getting a 1-yard touchdown run, yet the Raiders would answer with kicker Sebastian Janikowski getting a 33-yard and a 30-yard field goal. Houston would tie the game in the second quarter with kicker Kris Brown getting a 53-yard and a 24-yard field goal. Oakland would take the lead in the third quarter with wide receiver Johnnie Lee Higgins catching a 29-yard touchdown pass from Russell, followed up by an 80-yard punt return for a touchdown. The Texans tried to rally in the fourth quarter as Brown nailed a 40-yard field goal, yet the Raiders’ defense would shut down any possible attempt.”, ‘question’: ‘Who scored the first touchdown of the game?’, ‘answers_spans’: {‘spans’: [‘Chaz Schilens’, ‘JaMarcus Russell’], ‘types’: [‘span’, ‘span’]}

}

class promptbench.dataload.dataset.Dataset(dataset_name)¶

Bases: object

extract_answer(output)¶

class promptbench.dataload.dataset.GLUE(task)¶

Bases: Dataset

GLUE class is a dataset class for the General Language Understanding Evaluation benchmark, supporting multiple natural language understanding tasks.

Examples: [{‘content’: “it ‘s a charming and often affecting journey . “, ‘label’: 1}, {‘content’: ‘unflinchingly bleak and desperate ‘, ‘label’: 0}, …]

class promptbench.dataload.dataset.GSM8K¶

Bases: Dataset

GSM8K is a dataset class that loads mathematical questions and answers from the Hugging Face datasets (main test set). The dataset is a collection of 8.5K high-quality, linguistically diverse grade school math word problems. GSM8K dataset is loaded from huggingface datasets: /gsm8k (test set).

Reference: https://huggingface.co/datasets/gsm8k/viewer/main/test

Example data format: [{‘question’: “A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?”, ‘answer’: “It takes 2/2=<<2/2=1>>1 bolt of white fiber So the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric #### 3”}, …]

extract_answer(output)¶

class promptbench.dataload.dataset.IWSLT(supported_languages)¶

Bases: Dataset

IWSLT is a dataset class for the International Workshop on Spoken Language Translation. It includes pairs of sentences in different languages, intended for translation tasks. The dataset is loaded from Hugging Face datasets and supports multiple language pairs.

Reference: https://huggingface.co/datasets/iwslt

Example data format: [{‘source’: ‘قبل عدة سنوات، هنا في تيد، قدّم بيتر سكيلمان منافسة تصميم تسمى منافسة حلوى المارش مالو.’, ‘target’: ‘Several years ago here at TED, Peter Skillman introduced a design challenge called the marshmallow challenge.’, ‘soruce_lang’: ‘Arabic’, ‘target_lang’: ‘English’}, …]

class promptbench.dataload.dataset.LastLetterConcat¶

Bases: Dataset

LastLetterConcat is a dataset class that loads questions and answers from the Last Letter Concat dataset. The dataset is initialized by reading from a JSON file, with each entry containing a query and a answer.

Reference: https://arxiv.org/pdf/2201.11903.pdf (page 8)

Example data format: [{“question”: “Take the last letters of each words in “Whitney Erika Tj Benito” and concatenate them.”, “answer”: “yajo”}, …]

class promptbench.dataload.dataset.MMLU¶

Bases: Dataset

MMLU is a dataset class for the Multimodal Multi-Task Learning Understanding dataset, covering various educational and professional fields. MMLU dataset is loaded from huggingface datasets: lukaemon/mmlu (test set).

Reference: https://huggingface.co/datasets/lukaemon/mmlu/viewer/abstract_algebra/test

Example data format: [{‘input’: “This question refers to the following information.

Read the the following quotation to answer questions. The various modes of worship which prevailed in the Roman world were all considered by the people as equally true; by the philosopher as equally false; and by the magistrate as equally useful. Edward Gibbon, The Decline and Fall of the Roman Empire, 1776–1788 Gibbon’s interpretation of the state of religious worship in ancient Rome could be summarized as”, ‘A’: “In ancient Rome, religious worship was decentralized and tended to vary with one’s social position.”, ‘B’: ‘In ancient Rome, religious worship was the source of much social tension and turmoil.’, ‘C’: ‘In ancient Rome, religious worship was homogeneous and highly centralized.’, ‘D’: ‘In ancient Rome, religious worship was revolutionized by the introduction of Christianity.’, ‘target’: ‘A’, ‘task’: ‘high_school_european_history’}, …]

class promptbench.dataload.dataset.MMMU¶

Bases: Dataset

MMMU is a dataset class for the MMMU dataset. This dataset is loaded from huggingface datasets: mmlu (validation set).

Reference: https://huggingface.co/datasets/lmms-lab/MMMU MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (https://arxiv.org/abs/2311.16502)

{: ‘id’: ‘validation_Accounting_1’, ‘question’: ‘<image 1> Baxter Company has a relevant range of production between 15,000 and 30,000 units. The following cost data represents average variable costs per unit for 25,000 units of production. If 30,000 units are produced, what are the per unit manufacturing overhead costs incurred?’, ‘options’: “[‘$6’, ‘$7’, ‘$8’, ‘$9’]”, ‘explanation’: ‘’, ‘image_1’: <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=733x237>, ‘image_2’: None, ‘image_3’: None, ‘image_4’: None, ‘image_5’: None, ‘image_6’: None, ‘image_7’: None, ‘img_type’: “[‘Tables’]”, ‘answer’: ‘B’, ‘topic_difficulty’: ‘Medium’, ‘question_type’: ‘multiple-choice’, ‘subfield’: ‘Managerial Accounting’

}

class promptbench.dataload.dataset.Math(task)¶

Bases: Dataset

Math is a dataset class that loads mathematical questions and answers from the Hugging Face datasets (math_dataset test set). This dataset covers various types of math questions, such as algebra, calculus, and arithmetic. It is initialized with a specific type of math question.

Reference: https://huggingface.co/datasets/math_dataset/

Example data format: [{‘question’: “Solve -282*d + 929 - 178 = -1223 for d.n’”, ‘answer’: “b’7n’”, ‘task’: ‘algebra__linear_1d’}, …]

class promptbench.dataload.dataset.MathVista¶

Bases: Dataset

MathVista is a dataset class for the MathVista dataset. This dataset is loaded from huggingface datasets: math_vista (testmini set).

Reference: https://huggingface.co/datasets/AI4Math/MathVista MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts (https://arxiv.org/abs/2310.02255)

Example data format: {

‘pid’: ‘1’, ‘question’: “When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object’s displacement. The reason is that there is no one value for the force-it changes. However, we can split the displacement up into an infinite number of tiny parts and then approximate the force in each as being constant. Integration sums the work done in all those parts. Here we use the generic result of the integration.

In Figure, a cumin canister of mass $m=0.40 mathrm{~kg}$ slides across a horizontal frictionless counter with speed $v=0.50 mathrm{~m} / mathrm{s}$. It then runs into and compresses a spring of spring constant $k=750 mathrm{~N} / mathrm{m}$. When the canister is momentarily stopped by the spring, by what distance $d$ is the spring compressed?”,: ‘image’: ‘images/1.jpg’, ‘decoded_image’: <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=1514x720>, ‘choices’: None, ‘unit’: None, ‘precision’: 1.0, ‘answer’: ‘1.2’, ‘question_type’: ‘free_form’, ‘answer_type’: ‘float’, ‘metadata’: {‘category’: ‘math-targeted-vqa’, ‘context’: ‘scientific figure’, ‘grade’: ‘college’, ‘img_height’: 720, ‘img_width’: 1514, ‘language’: ‘english’, ‘skills’: [‘scientific reasoning’], ‘source’: ‘SciBench’, ‘split’: ‘testmini’, ‘task’: ‘textbook question answering’}, ‘query’: “Hint: Please answer the question requiring a floating-point number with one decimal place and provide the final value, e.g., 1.2, 1.3, 1.4, at the end.

Question: When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object’s displacement. The reason is that there is no one value for the force-it changes. However, we can split the displacement up into an infinite number of tiny parts and then approximate the force in each as being constant. Integration sums the work done in all those parts. Here we use the generic result of the integration.

In Figure, a cumin canister of mass $m=0.40 mathrm{~kg}$ slides across a horizontal frictionless counter with speed $v=0.50 mathrm{~m} / mathrm{s}$. It then runs into and compresses a spring of spring constant $k=750 mathrm{~N} / mathrm{m}$. When the canister is momentarily stopped by the spring, by what distance $d$ is the spring compressed?”: }

class promptbench.dataload.dataset.NoCaps¶

Bases: Dataset

NoCaps is a dataset class for the NoCaps dataset. This dataset is loaded from huggingface datasets: nocaps (validation set).

Reference: https://huggingface.co/datasets/HuggingFaceM4/NoCaps nocaps: novel object captioning at scale (https://arxiv.org/abs/1812.08658)

Example data format: {

‘image’: <PIL.JpegImagePlugin.JpegImageFile image mode=L size=732x1024>, ‘image_coco_url’: ‘https://s3.amazonaws.com/nocaps/val/0013ea2087020901.jpg’, ‘image_date_captured’: ‘2018-11-06 11:04:33’, ‘image_file_name’: ‘0013ea2087020901.jpg’, ‘image_height’: 1024, ‘image_width’: 732, ‘image_id’: 0, ‘image_license’: 0, ‘image_open_images_id’: ‘0013ea2087020901’, ‘annotations_ids’: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ‘annotations_captions’: [‘A baby is standing in front of a house.’, ‘A little girl in a white jacket and sandals.’, ‘A young child stands in front of a house.’, ‘A child is wearing a white shirt and standing on a side walk. ‘, ‘A little boy is standing in his diaper with a white shirt on.’, ‘A child wearing a diaper and shoes stands on the sidewalk.’, ‘A child is wearing a light-colored shirt during the daytime.’, ‘A little kid standing on the pavement in a shirt. ‘, ‘Black and white photo of a little girl smiling.’, ‘a cute baby is standing alone with white shirt’]

}

class promptbench.dataload.dataset.NumerSense¶

Bases: Dataset

NumerSense is a dataset class that loads questions and answers from the NumerSense dataset which is a unique numerical commonsense reasoning probing task. The dataset is initialized by reading from a JSON file, with each entry containing a query and a answer (English number).

https://github.com/INK-USC/NumerSense/blob/main/data/validation.masked.tsv

Example data format: [{“query”: “you may take the subway back and forth to work <mask> days a week.”, “answer”: “five”}, …]

class promptbench.dataload.dataset.QASC¶

Bases: Dataset

QASC is a dataset class that loads questions and answers from the Question Answering via QASC dataset. QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences. QASC dataset now is loaded from huggingface datasets: /qasc (val set).

Reference: https://huggingface.co/datasets/qasc/viewer/default/validation

Example data format: [‘id’: “3NGI5ARFTT4HNGVWXAMLNBMFA0U1PG”, ‘question’: “Climate is generally described in terms of what?”, ‘choices’: { “text”: [ “sand”, “occurs over a wide range”, “forests”, “Global warming”, “rapid changes occur”, “local weather conditions”, “measure of motion”, “city life” ], “label”: [ “A”, “B”, “C”, “D”, “E”, “F”, “G”, “H” ] }, ‘answerKey’: “F”, ‘fact1’: “Climate is generally described in terms of temperature and moisture.”, ‘fact2’: “Fire behavior is driven by local weather conditions such as winds, temperature and moisture.”, ‘combinedfact’:”Climate is generally described in terms of local weather conditions”, ‘formatted_question’: “Climate is generally described in terms of what? (A) sand (B) occurs over a wide range (C) forests (D) Global warming (E) rapid changes occur (F) local weather conditions (G) measure of motion (H) city life”]

extract_answer(output)¶

class promptbench.dataload.dataset.SQUAD_V2¶

Bases: Dataset

SQUAD_V2 is a dataset class for the Stanford Question Answering Dataset (SQuAD) version 2, which involves question-answering tasks. SQUAD_V2 dataset is loaded from huggingface datasets. Reference: https://huggingface.co/datasets/squad_v2

Example data format: [{‘id’: ‘56ddde6b9a695914005b9628’, ‘title’: ‘Normans’, ‘context’: ‘The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (“Norman” comes from “Norseman”) raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.’, ‘question’: ‘In what country is Normandy located?’, ‘answers’: {‘text’: [‘France’, ‘France’, ‘France’, ‘France’], ‘answer_start’: [159, 159, 159, 159]}}, …]

class promptbench.dataload.dataset.ScienceQA¶

Bases: Dataset

ScienceQA is a dataset class for the ScienceQA dataset. This dataset is loaded from huggingface datasets: science_qa (validation set).

Reference: https://huggingface.co/datasets/derek-thomas/ScienceQA Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering (https://arxiv.org/abs/2209.09513)

Example data format: {

‘image’: None, ‘question’: ‘Which figure of speech is used in this text?

Sing, O goddess, the anger of Achilles son of Peleus, that brought countless ills upon the Achaeans. —Homer, The Iliad’,

‘choices’: [‘chiasmus’, ‘apostrophe’], ‘answer’: 1, ‘hint’: ‘’, ‘task’: ‘closed choice’, ‘grade’: ‘grade11’, ‘subject’: ‘language science’, ‘topic’: ‘figurative-language’, ‘category’: ‘Literary devices’, ‘skill’: ‘Classify the figure of speech: anaphora, antithesis, apostrophe, assonance, chiasmus, understatement’, ‘lecture’: ‘Figures of speech are words or phrases that use language in a nonliteral or unusual way. They can make writing more expressive.

Anaphora is the repetition of the same word or words at the beginning of several phrases or clauses. We are united. We are powerful. We are winners. Antithesis involves contrasting opposing ideas within a parallel grammatical structure. I want to help, not to hurt. Apostrophe is a direct address to an absent person or a nonhuman entity. Oh, little bird, what makes you sing so beautifully? Assonance is the repetition of a vowel sound in a series of nearby words. Try to light the fire. Chiasmus is an expression in which the second half parallels the first but reverses the order of words. Never let a fool kiss you or a kiss fool you. Understatement involves deliberately representing something as less serious or important than it really is. As you know, it can get a little cold in the Antarctic.’,

‘solution’: ‘The text uses apostrophe, a direct address to an absent person or a nonhuman entity.

O goddess is a direct address to a goddess, a nonhuman entity.’}: }

class promptbench.dataload.dataset.UnMulti¶

Bases: Dataset

UnMulti is a dataset class for multilingual translation tasks. It includes translations between multiple language pairs. The dataset is partially loaded from a JSON file due to its large size.

Example data format: [{‘source’: ‘4 - العميد بحري مرتضى سفاري، قائد القوات البحرية’, ‘target’: ‘Konteradmiral Morteza Safari (Kommandeur der Marine des Korps der Iranischen Revolutionsgarden)’, ‘soruce_lang’: ‘Arabic’, ‘target_lang’: ‘German’}, …]

class promptbench.dataload.dataset.VQAv2¶

Bases: Dataset

VQAv2 is a dataset class for the Visual Question Answering v2 dataset. This dataset is loaded from huggingface datasets: vqa_v2 (validation set).

Reference: https://huggingface.co/datasets/HuggingFaceM4/VQAv2 Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering (https://arxiv.org/abs/1612.00837)

Example data format: {

‘question_type’: ‘what is’, ‘multiple_choice_answer’: ‘picnic table’, ‘answers’: [{‘answer’: ‘table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 1}, {‘answer’: ‘table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 2}, {‘answer’: ‘table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 3}, {‘answer’: ‘picnic table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 4}, {‘answer’: ‘picnic table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 5}, {‘answer’: ‘picnic table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 6}, {‘answer’: ‘picnic table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 7}, {‘answer’: ‘picnic table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 8}, {‘answer’: ‘skateboard’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 9}, {‘answer’: ‘picnic table’, ‘answer_confidence’: ‘yes’, ‘answer_id’: 10}], ‘image_id’: 262148, ‘answer_type’: ‘other’, ‘question_id’: 262148002, ‘question’: ‘What is he on top of?’, ‘image’: <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x512>

}

class promptbench.dataload.dataset.ValidParentheses¶

Bases: Dataset

ValidParentheses is a dataset class (obtained from BigBench) for validating parentheses in strings. It checks if the given string of parentheses is valid or invalid. The dataset is initialized by reading from a JSON file, with each entry containing a string of parentheses and a label (‘valid’ or ‘invalid’).

https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/cs_algorithms

Example data format: [{‘content’: ‘( ] } (’, ‘label’: ‘invalid’}, …]

promptbench.dataload.dataset.shuffleDict(d)¶