| Paper ID | SPE-48.1 |
| Paper Title |
MIXSPEECH: DATA AUGMENTATION FOR LOW-RESOURCE AUTOMATIC SPEECH RECOGNITION |
| Authors |
Linghui Meng, Institute of Automation, Chinese Academy of Sciences, China; Jin Xu, Institute for Interdisciplinary Information Sciences, Tsinghua University, China; Xu Tan, Jindong Wang, Tao Qin, Microsoft Research Asia, China; Bo Xu, Institute of Automation, Chinese Academy of Sciences, China |
| Session | SPE-48: Speech Recognition 18: Low Resource ASR |
| Location | Gather.Town |
| Session Time: | Friday, 11 June, 11:30 - 12:15 |
| Presentation Time: | Friday, 11 June, 11:30 - 12:15 |
| Presentation |
Poster
|
| Topic |
Speech Processing: [SPE-GASR] General Topics in Speech Recognition |
| IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
| Virtual Presentation |
Click here to watch in the Virtual Conference |
| Abstract |
In this paper, we propose MixSpeech, a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR). MixSpeech trains an ASR model by taking a weighted combination of two different speech features (e.g., mel-spectrograms or MFCC) as the input, and recognizing both text sequences, where the two recognition losses use the same combination weight. We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer, and conduct experiments on several low-resource datasets including TIMIT, WSJ, and HKUST. Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strong data augmentation method SpecAugment on these recognition tasks. Specifically, MixSpeech outperforms SpecAugment with a relative PER improvement of 10.6% on TIMIT dataset, and achieves a strong WER of 4.7% on WSJ dataset. |