MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

- 8 mins

Authors : Jiaao Chen, Zichao Yang, Diyi Yang
Georgia Tech / CMU
ACL 2020
Paper : https://arxiv.org/pdf/2004.12239.pdf
Code : https://github.com/GT-SALT/MixText


Summary

개인적 견해


Abstract

This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data. By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-ofthe-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.

1. Introduction

2.1. Pre-training and Fine-tuning Framework

: GPT, BERT 와 같은 PLM 모델에 대해 소개

2.2. Semi-Supervised Learning on Text Data

: VAE, Adversarial training, Virtual Adversarial Training (VAT), Unsupervised Data Augmentation (UDA) 등을 소개

2.3. Interpolation-based Regularizers

: MixUp 기반의 방법들을 사용한 논문들 소개

2.4. Data Augmentations for Text

: 동의어 치환, 랜덤 삭제 등 EDA 또는 Back translation 등 다양한 Augmentation 기법 소개

3. TMix

Reference : https://hoya012.github.io/blog/Bag-of-Tricks-for-Image-Classification-with-Convolutional-Neural-Networks-Review/

4. Semi-supervised MixText

MixText는 Text semi-supervised learning framework 이고, TMix는 Data Augmentation 기법이므로 혼동하지 말아야함

Labeled data와 Unlabeled data를 모두 사용하여 classifier를 학습하도록 하는 것이 목적

이러한 학습에는 Unlabeled data에 대하여 label을 붙여주는 과정이 필요

해당 과정에서 Data augmentation, label guessing, entropy minimization 사용

전체 Flow는 다음과 같음

4.1. Data Augmentation

4.1. Label Guessing

4.3. TMix on Labeled and Unlabeled Data

4.2. Entropy Minimization

5. Experiments

5.1. Dataset and Pre-processing

5.2. Baselines

5.3. Model Settings

We used BERT-based-uncased tokenizer to tokenize the text, bert-based-uncased model as our text encoder, and used average pooling over the output of the encoder, a two-layer MLP with a 128 hidden size and tanh as its activation function to predict the labels. The max sentence length is set as 256. We remained the first 256 tokens for sentences that exceed the limit. The learning rate is 1e-5 for BERT encoder, 1e-3 for MLP. For α in the beta distribution, generally, when labeled data is fewer than 100 per class, α is set as 2 or 16, as larger α is more likely to generate λ around 0.5, thus creating “newer” data as data augmentations; when labeled data is more than 200 per class, α is set to 0.2 or 0.4, as smaller α is more likely to generate λ around 0.1, thus creating “similar” data as adding noise regularization. For TMix, we only utilize the labeled dataset as the settings in Bert baseline, and set the batch size as 8. In MixText, we utilize both labeled data and unlabeled data for training using the same settings as in UDA. We set K = 2, i.e., for each unlabeled data we perform two augmentations, specifically German and Russian. The batch size is 4 for labeled data and 8 for unlabeled data. 0.5 is used as a starting point to tune temperature T. In our experiments, we set 0.3 for AG News, 0.5 for DBpedia and Yahoo! Answer, and 1 for IMDB.

5.4. Results

5.5. Ablation Studies

6. Conclusion

To alleviate the dependencies of supervised models on labeled data, this work presented a simple but effective semi-supervised learning method, MixText, for text classification, in which we also introduced TMix, an interpolation-based augmentation and regularization technique. Through experiments on four benchmark text classification datasets, we demonstrated the effectiveness of our proposed TMix technique and the Mixup model, which have better testing accuracy and more stable loss trend, compared with current pre-training and fine-tuning models and other state-of-the-art semi-supervised learning methods. For future direction, we plan to explore the effectiveness of MixText in other NLP tasks such as sequential labeling tasks and other real-world scenarios with limited labeled data.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora