| Paper ID | MLSP-10.6 | 
  
    | Paper Title | 
     EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION | 
  
	| Authors | 
    Zixuan Peng, Yu Lu, Shengfeng Pan, Yunfeng Liu, Zhuiyi Technology, China | 
  | Session | MLSP-10: Deep Learning for Speech and Audio | 
  | Location | Gather.Town | 
  | Session Time: | Tuesday, 08 June, 16:30 - 17:15 | 
  | Presentation Time: | Tuesday, 08 June, 16:30 - 17:15 | 
  | Presentation | 
     Poster
     | 
	 | Topic | 
     Machine Learning for Signal Processing: [MLR-LMM] Learning from multimodal data | 
  
	
    | IEEE Xplore Open Preview | 
     Click here to view in IEEE Xplore | 
  
  
	
    | Virtual Presentation | 
     Click here to watch in the Virtual Conference | 
  
  
  
    | Abstract | 
     Emotion recognition from speech is a challenging task. Recent advances in deep learning have led bi-directional recurrent neural network (Bi-RNN) and attention mechanism as a standard method for speech emotion recognition, extracting and attending multi-modal features - audio and text, and then fused for downstream emotion classification tasks. In this paper, we propose a simple yet efficient neural network architecture to exploit both acoustic and lexical information from speech. The proposed framework using multi-scale convolutional layers (MSCNN) to obtain both audio and text hidden representations. Then, a statistical pooling unit (SPU) is used to further extract the features in each modality. Besides, an attention module can be built on top of the MSCNN-SPU (audio) and MSCNN (text) to further improve the performance. Extensive experiments show that the proposed model outperforms previous state-of-the-art methods on IEMOCAP dataset with four emotion categories (i.e., angry, happy, sad and neutral) in both weighted accuracy (WA) and unweighted accuracy (UA), with an improvement of 5.0% and 5.2% respectively under the ASR setting. |