SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness

- 12 mins

Authors : Nathan Ng (University of Toronto Vector Institute), Kyunghyun Cho (New York University), Marzyeh Ghassemi (University of Toronto Vector Institute)

EMNLP2020 Paper : https://www.aclweb.org/anthology/2020.emnlp-main.97.pdf Code : https://github.com/nng555/ssmba


Summary

개인적 견해


Abstract

Models that perform well on a training domain often fail to generalize to out-of-domain (OOD) examples. Data augmentation is a common method used to prevent overfitting and improve OOD generalization. However, in natural language, it is difficult to generate new examples that stay on the underlying data manifold. We introduce SSMBA, a data augmentation method for generating synthetic training examples by using a pair of corruption and reconstruction functions to move randomly on a data manifold. We investigate the use of SSMBA in the natural language domain, leveraging the manifold assumption to reconstruct corrupted text with masked language models. In experiments on robustness benchmarks across 3 tasks and 9 datasets, SSMBA consistently outperforms existing data augmentation methods and baseline models on both in-domain and OOD data, achieving gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 German-English.

1. Introduction

2. 1. Data Augmentation in NLP

2. 2. VRM and the Manifold Assumption

2. 3. Sampling from Denoising Autoencoders

2. 4. Masked Language Models

3. SSMBA: Self-Supervised Manifold Based Augmentation

4. Datasets

4.1. Sentiment Analysis

4.2. Natural Language Inference

4.3. Machine Translation

5. Experimental Setup

5. 1. Model Types

5. 2. SSMBA Settings

5. 3. Baselines

Sentiment analysis and NLI tasks

MT tasks

5. 4. Evaluation Method

6. Results

6. 1. Sentiment Analysis

6. 2. Natural Language Inference

6. 3. Machine Translation

7. Analysis and Discussion

아래의 특성 및 설정으로 Baby domain within the AR-Clothing dataset에 대한 분석

7. 1. Training Set Size

7. 2. Reconstruction Model Capacity

7. 3. Corruption Amount

7. 4. Sample Generation Methods

7. 5. Amount of Augmentation

7. 6. Label Generation

8. Conclusion

In this paper, we introduce SSMBA, a method for generating synthetic data in settings where the underlying data manifold is difficult to characterize. In contrast to other data augmentation methods, SSMBA is applicable to any supervised task, requires no task-specific knowledge, and does not rely on dataset-specific fine-tuning. We demonstrate SSMBA’s effectiveness on three NLP tasks spanning classification and sequence modeling: sentiment analysis, natural language inference, and machine translation. We achieve gains of 0.8% accuracy on OOD Amazon reviews, 1.8% accuracy on OOD MNLI, and 1.4 BLEU on in-domain IWSLT14 de!en. Our analysis shows that SSMBA is robust to the initial dataset size, reconstruction model choice, and corruption amount, offering OOD robustness improvements in most settings. Future work will explore applying SSMBA to the target side manifold in structured prediction tasks, as well as other natural language tasks and settings where data augmentation is difficult.

Dongju Park

Dongju Park

Research Scientist / Engineer @ NAVER CLOVA

comments powered by Disqus
rss facebook twitter github gitlab googlescholar youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora