2021 IEEE International Conference on Acoustics, Speech and Signal Processing

Technical Program

Paper ID	HLT-12.2
Paper Title	REPLACING HUMAN AUDIO WITH SYNTHETIC AUDIO FOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION
Authors	Daria Soboleva, Ondrej Skopek, Márius Šajgalík‎, Victor Cărbune‎, Felix Weissenberger, Julia Proskurnia, Bogdan Prisacari, Daniel Valcarce, Justin Lu, Rohit Prabhavalkar, Balint Miklos, Google, Switzerland
Session	HLT-12: Language Understanding 4: Semantic Understanding
Location	Gather.Town
Session Time:	Thursday, 10 June, 13:00 - 13:45
Presentation Time:	Thursday, 10 June, 13:00 - 13:45
Presentation	Poster
Topic	Human Language Technology: [HLT-UNDE] Spoken Language Understanding and Computational Semantics
IEEE Xplore Open Preview	Click here to view in IEEE Xplore
Virtual Presentation	Click here to watch in the Virtual Conference
Abstract	We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low.