Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Cheolhyoung Lee; Kyunghyun Cho; Wanmo Kang

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Cheolhyoung Lee, Kyunghyun Cho, Wanmo Kang

Keywords: dropout, generalization, nlp, optimization, regularization, stability, transformer

Abstract Paper Code Reviews Chat

Thurs Session 1 (05:00-07:00 GMT) [Live QA] [Cal]

Thurs Session 3 (12:00-14:00 GMT) [Live QA] [Cal]

Abstract: In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as “mixout”, motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.

Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Cheolhyoung Lee, Kyunghyun Cho, Wanmo Kang

Similar Papers

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut,

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model

Wenhan Xiong, Jingfei Du, William Yang Wang, Veselin Stoyanov,

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, Cho-Jui Hsieh,

Spectral Embedding of Regularized Block Models

Nathan De Lara, Thomas Bonald,