| Paper ID | SPE-17.1 |
| Paper Title |
TIME-DOMAIN SPEECH EXTRACTION WITH SPATIAL INFORMATION AND MULTI SPEAKER CONDITIONING MECHANISM |
| Authors |
Jisi Zhang, University of Sheffield, United Kingdom; Cătălin Zorilă, Rama Doddipatla, Toshiba Cambridge Research Laboratory, United Kingdom; Jon Barker, University of Sheffield, United Kingdom |
| Session | SPE-17: Speech Enhancement 3: Target Speech Extraction |
| Location | Gather.Town |
| Session Time: | Wednesday, 09 June, 14:00 - 14:45 |
| Presentation Time: | Wednesday, 09 June, 14:00 - 14:45 |
| Presentation |
Poster
|
| Topic |
Speech Processing: [SPE-ENHA] Speech Enhancement and Separation |
| IEEE Xplore Open Preview |
Click here to view in IEEE Xplore |
| Virtual Presentation |
Click here to watch in the Virtual Conference |
| Abstract |
In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embedding. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline. |