| Paper ID | SPE-3.4 |
| Paper Title |
END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE |
| Authors |
Yusuke Yasuda, Xin Wang, Junichi Yamagishi, National Institute of Informatics, Japan |
| Session | SPE-3: Speech Synthesis 1: Architecture |
| Location | Gather.Town |
| Session Time: | Tuesday, 08 June, 13:00 - 13:45 |
| Presentation Time: | Tuesday, 08 June, 13:00 - 13:45 |
| Presentation |
Poster
|
| Topic |
Speech Processing: [SPE-SYNT] Speech Synthesis and Generation |
| IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
| Virtual Presentation |
Click here to watch in the Virtual Conference |
| Abstract |
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods (Transformer-TTS, Tacotron2) and explicit duration modeling-based methods (Fastspeech). |