| Paper ID | SPE-33.5 | ||
| Paper Title | UNSUPERVISED LEARNING FOR MULTI-STYLE SPEECH SYNTHESIS WITH LIMITED DATA | ||
| Authors | Shuang Liang, Chenfeng Miao, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao, Ping An Technology, China | ||
| Session | SPE-33: Speech Synthesis 5: Prosody & Style | ||
| Location | Gather.Town | ||
| Session Time: | Thursday, 10 June, 13:00 - 13:45 | ||
| Presentation Time: | Thursday, 10 June, 13:00 - 13:45 | ||
| Presentation | Poster | ||
| Topic | Speech Processing: [SPE-SYNT] Speech Synthesis and Generation | ||
| IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
| Abstract | Existing multi-style speech synthesis methods require either style labels or large amounts of unlabeled training data, making data acquisition difficult. In this paper, we present an unsupervised multi-style speech synthesis method that can be trained with limited data. We leverage instance discriminator to guide a style encoder to learn meaningful style representations from a multi-style dataset. Furthermore, we employ information bottleneck to filter out style-irrelevant information in the representations, which can improve speech quality and style similarity. Our method is able to produce desirable speech using a fairly small dataset, where the baseline GST-Tacotron fails. ABX tests show that our model significantly outperforms {GST-Tacotron} in both emotional speech synthesis task and multi-speaker speech synthesis task. In addition, we demonstrate that our method is able to learn meaningful style features with only 50 training samples per style. | ||