2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

Technical Program

Paper Detail

Paper IDAUD-24.2
Paper Title TEACHER-STUDENT LEARNING FOR LOW-LATENCY ONLINE SPEECH ENHANCEMENT USING WAVE-U-NET
Authors Sotaro Nakaoka, Li Li, Shota Inoue, Shoji Makino, University of Tsukuba, Japan
SessionAUD-24: Signal Enhancement and Restoration 1: Deep Learning
LocationGather.Town
Session Time:Thursday, 10 June, 16:30 - 17:15
Presentation Time:Thursday, 10 June, 16:30 - 17:15
Presentation Poster
Topic Audio and Acoustic Signal Processing: [AUD-SEN] Signal Enhancement and Restoration
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Virtual Presentation  Click here to watch in the Virtual Conference
Abstract This paper proposes a low-latency online extension of wave- U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping high enhancement performance. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. Intending to apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required, in this paper, we investigate online versions of wave-U-net and propose using teacher-student learning to avoid the performance degradation caused by reducing input segmant length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the pro- posed model could perform in real-time and low-latency with a high performance of achieving a signal-to-distortion ratio improvement of about 8.35 dB.