2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information
Login Paper Search My Schedule Paper Index Help

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper IDSPE-19.2
Paper Title GRAPH ATTENTION NETWORKS FOR SPEAKER VERIFICATION
Authors Jee-weon Jung, University of Seoul, South Korea; Hee-Soo Heo, Naver Corporation, South Korea; Ha-Jin Yu, University of Seoul, South Korea; Joon Son Chung, Naver Corporation, South Korea
SessionSPE-19: Speaker Recognition 3: Attention and Adversarial
LocationGather.Town
Session Time:Wednesday, 09 June, 14:00 - 14:45
Presentation Time:Wednesday, 09 June, 14:00 - 14:45
Presentation Poster
Topic Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract This work presents a novel back-end framework for speaker verification using graph attention networks. Segment-wise speaker embeddings extracted from multiple crops within an utterance are interpreted as node representations of a graph. The proposed framework inputs segment-wise speaker embeddings from an enrollment and a test utterance and directly outputs a similarity score. We first construct a graph using segment-wise speaker embeddings and then input these to graph attention networks. After a few graph attention layers with residual connections, each node is projected into a one-dimensional space using affine transform, followed by a readout operation resulting in a scalar similarity score. To enable successful adaptation for speaker verification, we propose techniques such as separating trainable weights for attention map calculations between segment-wise speaker embeddings from different utterances. The effectiveness of the proposed framework is validated using three different speaker embedding extractors trained with different architectures and objective functions. Experimental results demonstrate consistent improvement over various baseline back-end classifiers, with an average equal error rate improvement of 20% over the cosine similarity back-end without test time augmentation.