| Paper ID | SPE-39.5 | 
  
    | Paper Title | 
     Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks | 
  
	| Authors | 
    Shoukang Hu, Xurong Xie, Shansong Liu, Mingyu Cui, Mengzhe Geng, Xunying Liu, Helen Meng, The Chinese University of Hong Kong, Hong Kong SAR China | 
  | Session | SPE-39: Speech Recognition 13: Acoustic Modeling 1 | 
  | Location | Gather.Town | 
  | Session Time: | Thursday, 10 June, 15:30 - 16:15 | 
  | Presentation Time: | Thursday, 10 June, 15:30 - 16:15 | 
  | Presentation | 
     Poster
     | 
	 | Topic | 
     Speech Processing: [SPE-RECO] Acoustic Modeling for Automatic Speech Recognition | 
  
	
    | IEEE Xplore Open Preview | 
     Click here to view in IEEE Xplore | 
  
  
	
    | Virtual Presentation | 
     Click here to watch in the Virtual Conference | 
  
  
  
    | Abstract | 
     Deep neural networks (DNNs) based automatic speech recognition (ASR) systems are often designed using expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of state-of-the-art factored time delay neural networks (TDNNs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These include the DARTS method integrating architecture selection with lattice-free MMI (LF-MMI) TDNN training; Gumbel-Softmax and pipelined DARTS reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to adjust the trade-off between performance and system complexity. Parameter sharing among candidate architectures allows efficient search over up to $7^{28}$ different TDNN systems. Experiments conducted on the 300-hour Switchboard corpus suggest the auto-configured systems consistently outperform the baseline LF-MMI TDNN systems using manual network design or random architecture search after LHUC speaker adaptation and RNNLM rescoring. Absolute word error rate (WER) reductions up to 1.0\% and relative model size reduction of 28\% were obtained. Consistent performance improvements were also obtained on a UASpeech disordered speech recognition task using the proposed NAS approaches. |