| Paper ID | SPE-41.5 | 
    | Paper Title | MARBLENET: DEEP 1D TIME-CHANNEL SEPARABLE CONVOLUTIONAL NEURAL NETWORK FOR VOICE ACTIVITY DETECTION | 
	| Authors | Fei Jia, Somshubra Majumdar, Boris Ginsburg, NVIDIA Corporation, United States | 
  | Session | SPE-41: Voice Activity and Disfluency Detection | 
  | Location | Gather.Town | 
  | Session Time: | Thursday, 10 June, 15:30 - 16:15 | 
  | Presentation Time: | Thursday, 10 June, 15:30 - 16:15 | 
  | Presentation | Poster | 
	 | Topic | Speech Processing: [SPE-VAD] Voice Activity Detection and End-pointing | 
  
	
    | IEEE Xplore Open Preview | Click here to view in IEEE Xplore | 
  
	
    | Virtual Presentation | Click here to watch in the Virtual Conference | 
  
  
    | Abstract | We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD). MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. When compared to a state-of-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost. We further conduct extensive ablation studies on different training methods and choices of parameters in order to study the robustness of MarbleNet in real-world VAD tasks. |