| Paper ID | SS-11.2 |
| Paper Title |
IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR |
| Authors |
Yosuke Higuchi, Waseda University, Japan; Hirofumi Inaguma, Kyoto University, Japan; Shinji Watanabe, Johns Hopkins University, United States; Tetsuji Ogawa, Tetsunori Kobayashi, Waseda University, Japan |
| Session | SS-11: On-device AI for Audio and Speech Applications |
| Location | Gather.Town |
| Session Time: | Thursday, 10 June, 14:00 - 14:45 |
| Presentation Time: | Thursday, 10 June, 14:00 - 14:45 |
| Presentation |
Poster
|
| Topic |
Special Sessions: On-device AI for Audio and Speech Applications |
| IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
| Virtual Presentation |
Click here to watch in the Virtual Conference |
| Abstract |
For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achieves remarkably fast inference speed, its recognition performance falls behind that of conventional autoregressive (AR) systems. To boost the performance of Mask-CTC, we first propose to enhance the encoder network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence, which allows the model to delete or insert tokens during inference. Experimental results on different ASR tasks show that the proposed approaches improve Mask-CTC significantly, outperforming a standard CTC model (15.5% -> 9.1% WER on WSJ). Moreover, Mask-CTC now achieves competitive results to AR models with no degradation of inference speed (< 0.1 RTF using CPU). We also show a potential application of Mask-CTC to end-to-end speech translation. |