How Many Data Points is a Prompt Worth?

- 7 mins

Authors : Teven Le Scao, Alexander M. Rush / Hugging Face

Arxiv 2021
Paper : https://arxiv.org/pdf/2103.08493.pdf
Code : -


Summary

개인적 견해


Abstract

When fine-tuning pretrained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. Proponents of prompting have argued that prompts provide a method for injecting task-specific guidance, which is beneficial in low-data regimes. We aim to quantify this benefit through rigorous testing of prompts in a fair setting: comparing prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. By controlling for many sources of advantage, we find that prompting does indeed provide a benefit, and that this benefit can be quantified per task. Results show that prompting is often worth 100s of data points on average across classification tasks.

1. Introduction

3. Comparison: Heads vs Prompts

► Toggle (Click me)

4. Experimental Setting

5. Results

6. Analysis

Impact of Pattern vs Verbalizer

Impact of Different Prompts

Metric sensitivity

7. Conclusion

We investigate prompting through a systematic study of its data advantage. Across tasks, prompting consistently yields a varying improvement throughout the training process. Analysis shows that prompting is mostly robust to pattern choice, and can even learn without an informative verbalizer. On large datasets, prompting is similarly helpful in terms of data points, although they are less beneficial in performance. In future work, we hope to study the mechanism and training dynamics of the prompting benefits.

8. Impact statement

Significant compute resources were used to run this paper’s experiments. A single experiment (defined as one model run, at one data level, on one task) was quite light-weight, taking usually a little un- der an hour on a single Nvidia V100. However, as we computed a little under two thousand runs, this adds up to about 1800 GPU hours, to which one must add around 400 GPU hours of prototyping and hyper-parameter searching. Those 2200 GPU hours would usually have necessitated the release of about 400kg of CO2, about 40% of a transatlantic flight for a single passenger, in the country where we ran the experiments, although we used a carbon-neutral cloud compute provider. The main benefit of prompting, rather than compute efficiency, is data efficiency. Although we ran all of our experiments on English, we hope that this property will be especially helpful in low-resource language applications. In a sense, a practitioner could then remedy the lack of task-specific data in their language by introducing information through a prompt. However, this comes with the inherent risk of introducing human biases into the model. Prompt completion also suffers from biases already present within the language model (Sheng et al., 2019). This could cause a prompted model to repeat those biases in classification, especially in the few-shot setting where prompting mostly relies on the pretrained model.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora