IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

CHLG-3.4

Paper Title

PROSODY AND VOICE FACTORIZATION FOR FEW-SHOT SPEAKER ADAPTATION IN THE CHALLENGE M2VOC 2021

Authors

Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Chunyu Qiang, Shiming Wang, Institute of Automation, Chinese Academy of Sciences, China

Session

CHLG-3: Multi-Speaker Multi-Style Voice Cloning Challenge (M2VoC)

Location

Zoom

Session Time:

Monday, 07 June, 15:30 - 17:45

Presentation Time:

Monday, 07 June, 15:30 - 17:45

Presentation

Poster

Topic

Grand Challenge: Multi-Speaker Multi-Style Voice Cloning Challenge (M2VoC)

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

The paper describes the CASIA speech synthesis system entry for challenge M2VoC 2021. The low similarity and naturalness of synthesized speech remains a challenging problem for speaker adaptation with few resources. Since the end-to-end acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the prosody and voice characteristics in the end-to-end model. A prosody control attention is proposed to control the phonemes' duration of different speakers. To make the attention controlled by the prosody information, a set of phoneme-level transition tokens is auto-learned from the prosody encoder in our framework and these transition tokens can determine the duration of phonemes in the attention mechanism. Secondly, when we need to use small data set for speaker adaptation, we just need to adapt the speaker related prosody model and decoder, which can prevent the model from overfitting. Further, we use a data puring model to automatically optimize the quality of datasets. Experiments demonstrate the effectiveness of speaker adaptation based on our method, and we (team identifier is T03) get the top three results in competition M2VoC by using this framework.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail