2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-9.4
Paper Title	RECENT DEVELOPMENTS ON ESPNET TOOLKIT BOOSTED BY CONFORMER
Authors	Pengcheng Guo, Northwestern Polytechnical University; Johns Hopkins University, China; Florian Boyer, LaBRI, University of Bordeaux; Airudit, France; Xuankai Chang, Johns Hopkins University, United States; Tomoki Hayashi, Nagoya University; Human Dataware Lab. Co., Ltd., Japan; Yosuke Higuchi, Waseda University, Japan; Hirofumi Inaguma, Kyoto University, Japan; Naoyuki Kamo, NTT Corporation, Japan; Chenda Li, Shanghai Jiao Tong University, China; Daniel Garcia-Romero, Jiatong Shi, Johns Hopkins University, United States; Jing Shi, Institute of Automation, Chinese Academy of Sciences, China and Johns Hopkins University, United States; Shinji Watanabe, Johns Hopkins University,, United States; Kun Wei, Northwestern Polytechnical University, China; Wangyou Zhang, Shanghai Jiao Tong University, China; Yuekai Zhang, Johns Hopkins University, United States
Session	SPE-9: Speech Recognition 3: Transformer Models 1
Location	Gather.Town
Session Time:	Tuesday, 08 June, 16:30 - 17:15
Presentation Time:	Tuesday, 08 June, 16:30 - 17:15
Presentation	Poster
Topic	Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.