2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	MLSP-10.6
Paper Title	EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
Authors	Zixuan Peng, Yu Lu, Shengfeng Pan, Yunfeng Liu, Zhuiyi Technology, China
Session	MLSP-10: Deep Learning for Speech and Audio
Location	Gather.Town
Session Time:	Tuesday, 08 June, 16:30 - 17:15
Presentation Time:	Tuesday, 08 June, 16:30 - 17:15
Presentation	Poster
Topic	Machine Learning for Signal Processing: [MLR-LMM] Learning from multimodal data
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Emotion recognition from speech is a challenging task. Recent advances in deep learning have led bi-directional recurrent neural network (Bi-RNN) and attention mechanism as a standard method for speech emotion recognition, extracting and attending multi-modal features - audio and text, and then fused for downstream emotion classification tasks. In this paper, we propose a simple yet efficient neural network architecture to exploit both acoustic and lexical information from speech. The proposed framework using multi-scale convolutional layers (MSCNN) to obtain both audio and text hidden representations. Then, a statistical pooling unit (SPU) is used to further extract the features in each modality. Besides, an attention module can be built on top of the MSCNN-SPU (audio) and MSCNN (text) to further improve the performance. Extensive experiments show that the proposed model outperforms previous state-of-the-art methods on IEMOCAP dataset with four emotion categories (i.e., angry, happy, sad and neutral) in both weighted accuracy (WA) and unweighted accuracy (UA), with an improvement of 5.0% and 5.2% respectively under the ASR setting.