2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-50.6
Paper Title STABLE CHECKPOINT SELECTION AND EVALUATION IN SEQUENCE TO SEQUENCE SPEECH SYNTHESIS
Authors Slava Shechtman, David Haws, Raul Fernandez, IBM Research, Israel
SessionSPE-50: Voice Conversion & Speech Synthesis: Singing Voice & Other Topics
LocationGather.Town
Session Time:Friday, 11 June, 11:30 - 12:15
Presentation Time:Friday, 11 June, 11:30 - 12:15
Presentation Poster
Topic Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Autoregressive Attentive Sequence-to-Sequence (S2S) speech synthesis is considered state-of-the-art in terms of speech quality and naturalness, as evaluated on a finite set of testing utterances. However, it can occasionally suffer from stability issues at inference time, such as local intelligibility problems or utterance incompletion. Frequently, a model's stability varies from one checkpoint to another, even after the training loss shows signs of convergence, making the selection of a stable model a tedious and time-consuming task. In this work we propose a novel stability metric designed for automatic checkpoint selection based on incomplete utterance counts within a validation set. The metric is based solely on attention matrix analysis in inference mode and requires no ground-truth output targets. The proposal runs 125 times faster than real-time on a GPU (Tesla-K80), allowing convenient incorporation during training to filter out unstable checkpoints, and we demonstrate, via objective and perceptual metrics, its effectiveness in selecting a robust model that attains a good trade-off between stability and quality.