Do Not Have Enough Data? Deep Learning to the Rescue!

- 8 mins

Authors : Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich, Amir Kantor, George Kour, Segev Shlomov, Naama Tepper, Naama Zwerdling IBM Research AI, University of Haifa, Israel, Technion - Israel Institute of Technology

Paper : https://aaai.org/Papers/AAAI/2020GB/AAAI-AnabyA.4027.pdf
Code : None
AAAI 2020 Poster session


Summary

GPT-2 를 이용하여 데이터셋을 생성 한 후, 원래 데이터로 학습 된 classifier를 통해 생성된 데이터를 필터링 하여 남은 것들을 원래 데이터에 추가하는 방식으로 Data Augmentation.

Detail 및 Code가 존재하지 않아 실험에 대한 방법이 궁금한 점이 존재


Abstract

Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. **Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers’ performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.

1. Introduction

3. Problem Definition

Text classification problem에 대한 definition.

자세한 내용은 생략

Text classification is an instance of the supervised learning problem over textual data. …

4. LAMBADA Method

LAMBADA for its use of Language Model Based Data Augmentation, this method adds more synthesized, weakly-labeled data samples to a given dataset.

LAMBADA는 2개의 key ingredients를 갖음

Input

4.1 LAMBADA Algorithm

5. Experimental Results

3종류의 classifier (BERT, SVM, LSTM) 를 3종류의 데이터셋 (ATIS, TREC, WVA) 에 대하여 class 당 다양한 양의 데이터로 테스트함.

또한 LAMBADA를 다른 data augmentation 기법 (CVAE, EDA, CBERT)와 비교함

5.1 Datasets

Dataset을 train, validation, test sets (80%, 10%, 10%) 으로 랜덤하게 나누어 사용
Training set에서 각 클래스 별 5, 10, 20, 50, 100 개의 sample을 랜덤하게 추출하여 subset을 만들어 사용

5.2 Classifiers

자세한 내용은 논문 참조

5.3 Generative Models

Fair comparison을 위해 class label을 이용하여 문장을 생성할 수 있는 conditional generative model을 사용함

5.4 Results

Number of Samples and Classifiers

We compared the LAMBADA approach with the baseline using three different classifiers over varied numbers of trained samples: 5, 10, 20, 50, and 100 for each class. We used the ATIS dataset to discover for which sample size our approach is beneficial.

모든 classifier에 LAMBADA를 적용한 것이 클래스 당 sample size가 50이하일 때 좋음.

클래스 당 sample size가 100개일때는 LSTM과 SVM에서는 성능이 저하됨.

Datasets

We substantiate previous results by comparing the baseline to our LAMBADA approach over three datasets using five samples for each class. Table 4 shows that our approach significantly improves all classifiers over all datasets.

ATIS dataset에서 성능향상이 많이 보이는데 이는 imbalanced dataset에서 효과가 좋음을 암시

Comparison of Generative Models

Table 5 shows that our approach is statistically superior to all other generation algorithms in the ATIS and WVA datasets over all classifiers.

LAMBADA vs. Unlabeled Data

LAMBADA는 unlabeled data가 필요하지 않음
Unlabeled data를 사용하였을 떄와 LAMBADA (generation)을 사용하였을 때 비교
Unlabeled data는 semi-supervised approach (Ruder and Plank 2018) 방식을 사용

Unlabeled dataset을 만들기 위해 original dataset에서 label 정보를 무시한채 임의로 추출함
Weak labeling approach를 적용하기 위해 labeled dataset에 학습된 classifier를 이용하여 classification.
LAMBADA가 성능 향상에 더 도움이 되는걸로 보아 original dataset에서 weak labeling하여 사용하는 것 보다 generation 하여 데이터를 추가하는 것이 더 효과적임

성능 향상에 2가지 요인

“Generated” labels의 중요성을 평가하기 위하여 제거하고 실험을 해봄 (Unlab. GPT)

❓
Semi-supervised approach = weak labeling approach   
Unlabeled dataset / classifier를 training 할때 사용되는 data 비율  
"Generated" labels 의미  
Unlab. GPT 는 마찬가지로 weak labeling approach를 해서 labeling을 하였는지  

6. Discussion and Future Work

We introduce LAMBADA for improving classifiers’ performance. It involves fine-tuning a language model, generating new labeled-condition sentences and a filtering phase. We showed that our method statically improves classifiers’ performance on small data sets. In addition, we showed that LAMBADA beats the state-of-the-art techniques in data augmentation.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora