Industry Scale Semi-Supervised Learning for Natural Language Understanding

Tuesday. June 01, 2021 - 4 mins

Authors : Luoxin Chen, Francisco Garcia, Varun Kumar, He Xie, Jianhua Lu - Alexa AI (amazon)

NAACL2021
Paper : https://www.aclweb.org/anthology/2021.naacl-industry.39.pdf
Code : -

Problem Definition

SSL techniques in “real-world” NLU applications is still in question.

Having high quality labeled data is the key to achieve improving accuracy.
- However, obtaining human annotation is an expensive and time-consuming process.
A common practice to evaluate SSL algorithms is to take an existing labeled dataset and only use a small fraction of training data as labeled data, while treating the rest of the data as unlabeled dataset.
- Such evaluation, often constrained to the cases when labeled data is scarce, raises questions about the usefulness of different SSL algorithms in a real-world setting.
How much unlabeled data should we use for SSL and how to select unlabeled data from a large pool of unlabeled data?
Most SSL benchmarks make the assumption that unlabeled datasets come from the same distribution as the labeled datasets.
- This assumption is often violated as, by design, the labeled training datasets also contain synthetic data, crowd-sourced data to represent anticipated usages of a functionality, and unlabeled data often contain a lot of out of domain data.

Design of a production SSL pipeline which can be used to intelligently select unlabeled data to train SSL models.
Experimental comparison of four SSL techniques including, Pseudo-Label, Knowledge Distillation, Cross-View Training, and Virtual Adversarial Training in a real-world setting using data from Amazon Alexa.
Operational recommendations for NLP practitioners who would like to employ SSL in production setting.

First uses a classifier’s confidence score to filter domain specific unlabeled data from a very large pool of unlabeled data, which might contain data from multiple domains.
- Training a binary classifier on the labelled data, and use it to select the in-domain unlabelled data.
- Using Confidence scire 0.5 as the threshold for data selection.
The goal of the second stage filtering is to find a subset of data which could result in better performance in SSL training.

Selection by Submodular Optimization

(Paper : Submodularity for data selection in machine translation)
- For SSL data selection, we use 1-4 n-gram as features and logarithm as the concave function.
- We filter out any n-gram features which appear less than 30 times in Dl ∪ Du.
- The algorithm starts with Dl as the selected data and chooses the utterance from the candidate pool Du which provides maximum marginal gain.
Selection by Committee
- To detect data points on which the model is not reliable, we train a committee of n teacher models (we use n = 4 in this paper), and compute the average entropy of the probability distribution for every data point.
- Compute the average entropy of the predicted label distribution of x.
- We then identify an entropy threshold with an acceptable error rate for mis-annotations (e.g., 20%) based on a held-out dataset.

Measuring the diversity of the selected data by computing the unique n-gram ratio present in Dl ∪ Du and Dl data.
We observe that a diverse SSL pool does not necessarily lead to better performance.
This result highlights that simply optimizing for token diversity is not enough for improving SSL performance.

Prefer VAT and CVT SSL techniques over PL and KL.
Use data selection to select a subset of unlabeled data.
- Recommend Submodular Optimization based data selection in light of its lower cost and similar performance to committee based method.
- Optimizing data selection, when unlabeled data pool is of a drastically different distribution from the labeled data, remains a challenge and could benefit from further research.

Research Scientist / Engineer @ NAVER CLOVA