How Can We Know What Language Models Know?

- 14 mins

Authors : Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig / CMU, Bosch Research North America

Transactions of the Association for Computational Linguistics 2020
Paper : https://arxiv.org/pdf/1911.12543.pdf
Code : https://github.com/jzbjyb/LPAQA


Summary

개인적 견해


Abstract

Recent work has presented intriguing results examining the knowledge contained in language models (LM) by having the LM fill in the blanks of prompts such as “Obama is a __ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “Obama worked as a __” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA.

1. Introduction

2. Knowledge Retrieval from LMs

\[\hat{y} = \underset{ y'\in{V}}{\text{argmax}}P_{\text{LM}}(y'\vert x,t_{r})\]

3. Prompt Generation

3. 1. Mining-based Generation

Middle-word Prompts

Dependency-based Prompts

두 기법은 수동으로 생성 된 프롬프트에 의존하지 않으므로 subject-object paris를 얻을 수 있는 모든 관계에 대해 유연하게 적용이 가능하며 다양한 프롬프트가 만들어짐

그러나 이런 방식으로 얻은 프롬프트는 관계를 잘 나타내지 않을 수 있기에 프롬프트가 오히려 noise가 될 수 있음

3. 2. Paraphrasing-based Generation

4. Prompt Selection and Ensembling

4. 1. Top-1 Prompt Selection

4. 2. Rank-based Ensemble

4. 3. Optimized Ensemble

5. Main Experiments

5. 1. Experimental Settings

Dataset

Models

Evaluation Metrics

Method

Implementation Details

5. 2. Evaluation Results

Single Prompt Experiments

Prompt Ensembling

Mining vs. Paraphrasing

Middle-word vs. Dependency-based

Performance of Different LMs

LAMA-UHN Evaluation

Performance on Google-RE

5. 3. Analysis

Prediction Consistency by Prompt

POS-based Analysis

Cross-model Consistency

Linear vs. Log-linear Combination

6. Omitted Design Elements

6. 1. LM-aware Prompt Generation

6. 2. Forward and Backward Probabilities

Much work has focused on understanding the internal representations in neural NLP models (Belinkov and Glass, 2019), either by using extrinsic probing tasks to examine whether certain linguistic properties can be predicted from those representations (Shi et al., 2016; Linzen et al., 2016; Belinkov et al., 2017), or by ablations to the models to investigate how behavior varies (Li et al., 2016b; Smith et al., 2017). For contextualized representations in particular, a broad suite of NLP tasks are used to analyze both syntactic and semantic properties, providing evidence that contextualized representations encode linguistic knowledge in different layers (Hewitt and Manning, 2019; Tenney et al., 2019a,b; Jawahar et al., 2019; Goldberg, 2019). Different from analyses probing the representations themselves, our work follows Petroni et al. (2019); Pörner et al. (2019) in probing for factual knowledge. They use manually defined prompts, which may be under-estimating the true performance obtainable by LMs. Concurrently to this work, Bouraoui et al. (2020) made a similar observation that using different prompts can help better extract relational knowledge from LMs, but they use models explicitly trained for relation extraction whereas our methods examine the knowledge included in LMs without any additional training. Orthogonally, some previous works integrate external knowledge bases so that the language generation process is explicitly conditioned on symbolic knowledge (Ahn et al., 2016; Yang et al., 2017; IV et al., 2019; Hayashi et al., 2020). Similar extensions have been applied to pre-trained LMs like BERT, where contextualized representations are enhanced with entity embeddings (Zhang et al., 2019; Peters et al., 2019; Pörner et al., 2019). In contrast, we focus on better knowledge retrieval through prompts from LMs as-is, without modifying them.

8. Conclusion

In this paper, we examined the importance of the prompts used in retrieving factual knowledge from language models. We propose mining-based and paraphrasing-based methods to systematically generate diverse prompts to query specific pieces of relational knowledge. Those prompts, when combined together, improve factual knowledge retrieval accuracy by 8%, outperforming manually designed prompts by a large margin. Our analysis indicates that LMs are indeed more knowledgeable than initially indicated by previous results, but they are also quite sensitive to how we query them. This in- dicates potential future directions such as (1) more robust LMs that can be queried in different ways but still return similar results, (2) methods to incorporate factual knowledge in LMs, and (3) further improvements in optimizing methods to query LMs for knowledge. Finally, we have released all our learned prompts to the community as the LM Prompt and Query Archive (LPAQA), available at: https://github.com/jzbjyb/LPAQA.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora