||Pengcheng Guo, Northwestern Polytechnical University; Johns Hopkins University, China; Florian Boyer, LaBRI, University of Bordeaux; Airudit, France; Xuankai Chang, Johns Hopkins University, United States; Tomoki Hayashi, Nagoya University; Human Dataware Lab. Co., Ltd., Japan; Yosuke Higuchi, Waseda University, Japan; Hirofumi Inaguma, Kyoto University, Japan; Naoyuki Kamo, NTT Corporation, Japan; Chenda Li, Shanghai Jiao Tong University, China; Daniel Garcia-Romero, Jiatong Shi, Johns Hopkins University, United States; Jing Shi, Institute of Automation, Chinese Academy of Sciences, China and Johns Hopkins University, United States; Shinji Watanabe, Johns Hopkins University,, United States; Kun Wei, Northwestern Polytechnical University, China; Wangyou Zhang, Shanghai Jiao Tong University, China; Yuekai Zhang, Johns Hopkins University, United States|
|| In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing state-of-the-art research environments usually requiring high resources.