IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

SPE-22.5

Paper Title

DECOUPLING PRONUNCIATION AND LANGUAGE FOR END-TO-END CODE-SWITCHING AUTOMATIC SPEECH RECOGNITION

Authors

Shuai Zhang, School of Artificial Intelligence, University of Chinese Academy of Sciences, China; Jiangyan Yi, Institute of Automation, Chinese Academy of Sciences, China; Zhengkun Tian, Ye Bai, Jianhua Tao, Zhengqi Wen, School of Artificial Intelligence, University of Chinese Academy of Sciences, China

Session

SPE-22: Speech Recognition 8: Multilingual Speech Recognition

Location

Gather.Town

Session Time:

Wednesday, 09 June, 15:30 - 16:15

Presentation Time:

Wednesday, 09 June, 15:30 - 16:15

Presentation

Poster

Topic

Speech Processing: [SPE-MULT] Multilingual Recognition and Identification

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

Despite the recent significant advances witnessed in end-to-end (E2E) ASR system for code-switching, hunger for audio-text paired data limits the further improvement of the models' performance. In this paper, we propose a decoupled transformer model to use monolingual paired data and unpaired text data to alleviate the problem of code-switching data shortage. The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network. The A2P network can learn acoustic pattern scenarios using large-scale monolingual paired data. Meanwhile, it generates multiple phoneme sequence candidates for single audio data in real time during the training process. Then the generated phoneme-text paired data is used to train the P2T network. This network can be pre-trained with large amounts of external unpaired text data. By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model to a certain extent. Finally, the two networks are optimized jointly through attention fusion. We evaluate the proposed method on the public Mandarin-English code-switching dataset. Compared with our transformer baseline, the proposed method achieves 18.14\% relative mix error rate reduction.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail