Self-training Improves Pre-training for Natural Language Understanding

- 11 mins

Authors : Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Ves Stoyanov, Alexis Conneau / Facebook AI, Stanford University

Arxiv
Paper : https://arxiv.org/pdf/2010.02194.pdf
Code : https://github.com/facebookresearch/SentAugment


Summary

개인적 견해


Abstract

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semisupervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

1. Introduction

2. Approach

SentAugment 방법은 self-training을 위해 large bank of sentences에서 task-specific in-domain unlabeled data를 얻어 teacher model을 통해 label을 부여한다. 이때 teacher model로는 RoBERTa-Large가 사용된다.

이렇게 얻어진 데이터들은 student model을 학습하는데 사용된다.

2. 1. SentAugment: data augmentation for semi-supervised learning

대부분의 semi-supervised approaches는 in-domain labeled data에 의존하지만, 우리는 external data로 부터 이러한 데이터를 구축한다.




2. 2. Semi-supervised learning for natural language understanding

Data augmentation technique을 self-training과 KD를 결합하여 사용하여 이득을 얻는다.




3. Experimental setup

3. 1. Large-scale bank of sentences

3. 2. Evaluation datasets

3. 3. Training details


4. Analysis and Results

4. 1. Self-training experiments

4. 2. Few-shot learning experiments

4. 3. Knowledge distillation experiments

4. 4. Ablation study of data augmentation


5. 1. Sentence embeddings (SASE)

6. Conclusion

Recent work in natural language understanding has focused on unsupervised pretraining. In this paper, we show that self-training is another effective method to leverage unlabeled data. We introduce SentAugment, a new data augmentation method for NLP that retrieves relevant sentences from a large web data corpus. Self-training is complementary to unsupervised pre-training for a range of natural language tasks and their combination leads to further improvements on top of a strong RoBERTa baseline. We also explore knowledge distillation and extend previous work on few-shot learning by showing that open domain data with SentAugment is sufficient for good accuracy.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora