2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-2.3
Paper Title SIMPLEFLAT: A SIMPLE WHOLE-NETWORK PRE-TRAINING APPROACH FOR RNN TRANSDUCER-BASED END-TO-END SPEECH RECOGNITION
Authors Takafumi Moriya, Takanori Ashihara, Tomohiro Tanaka, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Yusuke Ijima, Ryo Masumura, Yusuke Shinohara, NTT Corporation, Japan
SessionSPE-2: Speech Recognition 2: Neural transducer Models 2
LocationGather.Town
Session Time:Tuesday, 08 June, 13:00 - 13:45
Presentation Time:Tuesday, 08 June, 13:00 - 13:45
Presentation Poster
Topic Speech Processing: [SPE-LVCR] Large Vocabulary Continuous Recognition/Search
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract Recurrent neural network-transducer (RNN-T) is promising for building time-synchronous end-to-end automatic speech recognition (ASR) systems, in part because it does not need frame-wise alignment between input features and target labels in the training step. Although training without alignment is beneficial, it makes it difficult to discern the relation between input features and output token sequences. This, in effect, degrades RNN-T performance. Our solution is SimpleFlat (SF), a novel and simple whole-network pre-training approach for RNN-T. SF extracts frame-wise alignments on-the-fly from the training dataset, and does not require any external resources. We distribute equal numbers of target tokens to each frame following RNN-T encoder output lengths by repeating each token. The frame-wise tokens so created are shifted, and also used as the prediction network inputs. Therefore, SF can be implemented by cross entropy loss computation as in autoregressive model training. Experiments on Japanese and English ASR tasks demonstrate that SF can effectively improve various RNN-T architectures.