2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-4.6
Paper Title	EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION
Authors	Xiong Cai, Dongyang Dai, Zhiyong Wu, Xiang Li, Jingbei Li, Tsinghua University, China; Helen Meng, Chinese University of Hong Kong, Hong Kong SAR China
Session	SPE-4: Speech Synthesis 2: Controllability
Location	Gather.Town
Session Time:	Tuesday, 08 June, 13:00 - 13:45
Presentation Time:	Tuesday, 08 June, 13:00 - 13:45
Presentation	Poster
Topic	Speech Processing: [SPE-SYNT] Speech Synthesis and Generation
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use soft labels on TTS datasets predicted by the trained SER model to build an auxiliary SER task that is jointly trained with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.