| Paper ID | HLT-12.2 | ||
| Paper Title | REPLACING HUMAN AUDIO WITH SYNTHETIC AUDIO FOR ON-DEVICE UNSPOKEN PUNCTUATION PREDICTION | ||
| Authors | Daria Soboleva, Ondrej Skopek, Márius Šajgalík, Victor Cărbune, Felix Weissenberger, Julia Proskurnia, Bogdan Prisacari, Daniel Valcarce, Justin Lu, Rohit Prabhavalkar, Balint Miklos, Google, Switzerland | ||
| Session | HLT-12: Language Understanding 4: Semantic Understanding | ||
| Location | Gather.Town | ||
| Session Time: | Thursday, 10 June, 13:00 - 13:45 | ||
| Presentation Time: | Thursday, 10 June, 13:00 - 13:45 | ||
| Presentation | Poster | ||
| Topic | Human Language Technology: [HLT-UNDE] Spoken Language Understanding and Computational Semantics | ||
| IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
| Abstract | We present a novel multi-modal unspoken punctuation prediction system for the English language which combines acoustic and text features. We demonstrate for the first time, that by relying exclusively on synthetic data generated using a prosody-aware text-to-speech system, we can outperform a model trained with expensive human audio recordings on the unspoken punctuation prediction problem. Our model architecture is well suited for on-device use. This is achieved by leveraging hash-based embeddings of automatic speech recognition text output in conjunction with acoustic features as input to a quasi-recurrent neural network, keeping the model size small and latency low. | ||