| Paper ID | SPE-36.2 |
| Paper Title |
VSET: A Multimodal Transformer for Visual Speech Enhancement |
| Authors |
Karthik Ramesh, Chao Xing, Wupeng Wang, Huawei, Canada; Dong Wang, Tsinghua University, China; Xiao Chen, Huawei, Hong Kong SAR China |
| Session | SPE-36: Speech Enhancement 6: Multi-modal Processing |
| Location | Gather.Town |
| Session Time: | Thursday, 10 June, 14:00 - 14:45 |
| Presentation Time: | Thursday, 10 June, 14:00 - 14:45 |
| Presentation |
Poster
|
| Topic |
Speech Processing: [SPE-ENHA] Speech Enhancement and Separation |
| IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
| Virtual Presentation |
Click here to watch in the Virtual Conference |
| Abstract |
The transformer architecture has shown great capability in learning long-term dependency and works well in multiple domains. However, transformer has been less considered in audio-visual speech enhancement (AVSE) research, partly due to the convention that treats speech enhancement as a short-time signal processing task. In this paper, we challenge this common belief and show that an audio-visual transformer can significantly improve AVSE performance, by learning the long-term dependency of both intra-modality and inter-modality. We test this new transformer-based AVSE model on the GRID and AVSpeech datasets, and show that it beats several state-of-the-art models by a large margin. |