2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-33.3
Paper Title PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
Authors Sri Karlapati, Ammar Abbas, Amazon, United Kingdom; Zack Hodari, University of Edinburgh, United Kingdom; Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman, Amazon, United Kingdom
SessionSPE-33: Speech Synthesis 5: Prosody & Style
LocationGather.Town
Session Time:Thursday, 10 June, 13:00 - 13:45
Presentation Time:Thursday, 10 June, 13:00 - 13:45
Presentation Poster
Topic Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of 13.2% in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.