2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

IEEE Signal Processing Society

Institute of Electrical and Electronics Engineers (IEEE)

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper ID	SPE-39.6
Paper Title	HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
Authors	Xuankai Chang, Johns Hopkins University, United States; Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka, Microsoft Corporation, United States
Session	SPE-39: Speech Recognition 13: Acoustic Modeling 1
Location	Gather.Town
Session Time:	Thursday, 10 June, 15:30 - 16:15
Presentation Time:	Thursday, 10 June, 15:30 - 16:15
Presentation	Poster
Topic	Speech Processing: [SPE-GASR] General Topics in Speech Recognition
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification. The E2E SA-ASR has shown significant improvement of speaker-attributed word error rate (SA-WER) for monaural overlapped speech consists of various number of speakers. However, it is known that E2E model sometimes suffered from degradation due to training / testing condition mismatches. Especially, it has not yet been investigated that whether the E2E SA-ASR model works well for very long recordings, which is longer than that in the training data. In this paper, we first explore the E2E SA-ASR for long-form multi-talker recordings while investigating a known decoding algorithm of long-form audio for single-speaker ASRs. We then propose a novel method, called hypothesis stitcher, that takes multiple hypotheses from short-segmented audio and outputs a fused single hypothesis. We propose several variants of model architectures for the hypothesis stitcher, and evaluate them by comparing with conventional decoding methods. In our evaluation with LibriSpeech and LibriCSS corpora, we show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.