A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training

Published in ACL, 2023

Recommended citation: Nitay Calderon, Subhabrata Mukherjee, Roi Reichart and Amir Kantor https://doi.org/10.18653/v1/2023.acl-long.818

Abstract Modern Natural Language Generation (NLG) models come with massive computational and storage requirements. In this work, we study the potential of compressing them, which is crucial for real-world applications serving millions of users. We focus on Knowledge Distillation (KD) techniques, in which a small student model learns to imitate a large teacher model, allowing to transfer knowledge from the teacher to the student. In contrast to much of the previous work, our goal is to optimize the model for a specific NLG task and a specific dataset. Typically, in real-world applications, in addition to labeled data there is abundant unlabeled task-specific data, which is crucial for attaining high compression rates via KD. In this work, we conduct a systematic study of task-specific KD techniques for various NLG tasks under realistic assumptions. We discuss the special characteristics of NLG distillation and particularly the exposure bias problem. Following, we derive a family of Pseudo-Target (PT) augmentation methods, substantially extending prior work on sequence-level KD. We propose the Joint-Teaching method for NLG distillation, which applies word-level KD to multiple PTs generated by both the teacher and the student. Our study provides practical model design observations and demonstrates the effectiveness of PT training for task-specific KD in NLG.
bibtex
@inproceedings{calderon2023systematic,
  author       = {Nitay Calderon and
                  Subhabrata Mukherjee and
                  Roi Reichart and
                  Amir Kantor},
  editor       = {Anna Rogers and
                  Jordan L. Boyd{-}Graber and
                  Naoaki Okazaki},
  title        = {A Systematic Study of Knowledge Distillation for Natural Language
                  Generation with Pseudo-Target Training},
  booktitle    = {Proceedings of the 61st Annual Meeting of the Association for Computational
                  Linguistics (Volume 1: Long Papers), {ACL} 2023, Toronto, Canada,
                  July 9-14, 2023},
  pages        = {14632--14659},
  publisher    = {Association for Computational Linguistics},
  year         = {2023},
  url          = {https://doi.org/10.18653/v1/2023.acl-long.818},
  doi          = {10.18653/v1/2023.acl-long.818},
  timestamp    = {Thu, 10 Aug 2023 12:36:02 +0200},
  biburl       = {https://dblp.org/rec/conf/acl/CalderonMRK23.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}