| Paper ID | SPE-20.4 |
| Paper Title |
DEEPTALK: VOCAL STYLE ENCODING FOR SPEAKER RECOGNITION AND SPEECH SYNTHESIS |
| Authors |
Anurag Chowdhury, Arun Ross, Prabu David, Michigan State University, United States |
| Session | SPE-20: Speaker Recognition 4: Applications |
| Location | Gather.Town |
| Session Time: | Wednesday, 09 June, 14:00 - 14:45 |
| Presentation Time: | Wednesday, 09 June, 14:00 - 14:45 |
| Presentation |
Poster
|
| Topic |
Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization |
| IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
| Virtual Presentation |
Click here to watch in the Virtual Conference |
| Abstract |
Automatic speaker recognition algorithms typically characterize speech audio using short-term spectral features that encode the physiological and anatomical aspects of speech production. Such algorithms do not fully capitalize on speaker-dependent characteristics present in behavioral speech features. In this work, we propose a prosody encoding network called DeepTalk for extracting vocal style features directly from raw audio data. The DeepTalk method outperforms several state-of-the-art speaker recognition systems across multiple challenging datasets. The speaker recognition performance is further improved by combining DeepTalk with a state-of-the-art physiological speech feature-based speaker recognition system. We also integrate DeepTalk into a current state-of-the-art speech synthesizer to generate synthetic speech. A detailed analysis of the synthetic speech shows that the DeepTalk captures F0 contours essential for vocal style modeling. Furthermore, DeepTalk-based synthetic speech is shown to be almost indistinguishable from real speech in the context of speaker recognition. |