EDA : Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

- 7 mins

Authors : Jason Wei, Kai Zou
Protago Labs Research / Dartmouth College / Georgetown University
EMNLP 2019
Paper : https://arxiv.org/pdf/1901.11196.pdf
Code : https://github.com/jasonwei20/eda_nlp


Summary

Synonym Replacement, Random Insertion / Random Swap / Random Deletion 방식을 사용하여 Data Augmentation 을 한다.
기존 방법들과 달리 외부 데이터나 모델 훈련 없이 성능을 증가 시킬 수 있었다.

개인적 견해


Abstract

We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.

1. Introduction

2. EDA

2.1 Synonym Replacement (SR)

: 주어진 문장에서 stop words가 아닌 \(n\)개의 단어를 선택하여 synonyms 중 랜덤으로 치환한다.

All synonyms for synonym replacements and random insertions were generated using WordNet (Miller, 1995).

2.2 Random Insertion (RI)

: 문장에서 stop word가 아닌 단어의 동의어를 찾고 해당 동의어를 문장의 임의의 위치에 삽입하는 것을 \(n\)번 반복 한다.

2.3 Random Swap (RS)

: 랜덤으로 2개의 단어를 선택하여 위치를 바꾼다. \(n\)번 반복한다.

2.4 Random Deletion (RD)

: 문장 내에서 각각의 단어에 대해 \(p\) 확률로 지운다.

3. Experimental Setup

3.1 Benchmark Datasets

각 데이터 셋의 통계량은 다음과 같다.

EDA는 dataset이 적을 때 성능을 향상시켜 줄 것이라는 가정을 하고 있기 때문에 datasets의 랜덤으로 줄여서 테스트를 한다. 총 4개의 subset 및 full set으로 평가한다.

Training set : \(N_{train}\) = {500, 2,000, 5,000, all available data}

3.2 Text Classification Models

4. Results

4.1 EDA Makes Gains

4.2 Training Set Sizing

4.3 Does EDA conserve true labels?

4.4 Ablation Study: EDA Decomposed

4.5 How much augmentation?

6. Discussion and Limitations

7. Conclusion

We have shown that simple data augmentation operations can boost performance on text classification tasks. Although improvement is at times marginal, EDA substantially boosts performance and reduces overfitting when training on smaller datasets. Continued work on this topic could explore the theoretical underpinning of the EDA operations. We hope that EDA’s simplicity makes a compelling case for further thought.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora