2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SAM-13.5
Paper Title	DATA FUSION FOR AUDIOVISUAL SPEAKER LOCALIZATION: EXTENDING DYNAMIC STREAM WEIGHTS TO THE SPATIAL DOMAIN
Authors	Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, Ruhr University Bochum, Germany; Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, NTT Corporation, Japan; Christopher Schymura, Ruhr University Bochum, Germany
Session	SAM-13: Multi-Channel Data Fusion and Processing
Location	Gather.Town
Session Time:	Friday, 11 June, 14:00 - 14:45
Presentation Time:	Friday, 11 June, 14:00 - 14:45
Presentation	Poster
Topic	Sensor Array and Multichannel Signal Processing: [SAM-DOAE] Direction of arrival estimation and source localization
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the acoustic and the visual modality may be corrupted in specific spatial regions, for instance due to poor lighting conditions or to the presence of background noise. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space. This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.