2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDSPE-39.6
Paper Title HYPOTHESIS STITCHER FOR END-TO-END SPEAKER-ATTRIBUTED ASR ON LONG-FORM MULTI-TALKER RECORDINGS
Authors Xuankai Chang, Johns Hopkins University, United States; Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka, Microsoft Corporation, United States
SessionSPE-39: Speech Recognition 13: Acoustic Modeling 1
LocationGather.Town
Session Time:Thursday, 10 June, 15:30 - 16:15
Presentation Time:Thursday, 10 June, 15:30 - 16:15
Presentation Poster
Topic Speech Processing: [SPE-GASR] General Topics in Speech Recognition
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification. The E2E SA-ASR has shown significant improvement of speaker-attributed word error rate (SA-WER) for monaural overlapped speech consists of various number of speakers. However, it is known that E2E model sometimes suffered from degradation due to training / testing condition mismatches. Especially, it has not yet been investigated that whether the E2E SA-ASR model works well for very long recordings, which is longer than that in the training data. In this paper, we first explore the E2E SA-ASR for long-form multi-talker recordings while investigating a known decoding algorithm of long-form audio for single-speaker ASRs. We then propose a novel method, called hypothesis stitcher, that takes multiple hypotheses from short-segmented audio and outputs a fused single hypothesis. We propose several variants of model architectures for the hypothesis stitcher, and evaluate them by comparing with conventional decoding methods. In our evaluation with LibriSpeech and LibriCSS corpora, we show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.