IEEE ICASSP 2021 || Toronto, Ontario, Canada || 6-11 June 2021

My ICASSP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.

Create a login based on your email (takes less than one minute)
Perform 'Paper Search'
Select papers that you desire to save in your personalized schedule
Click on 'My Schedule' to see the current list of selected papers
Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID

SPE-38.5

Paper Title

IMPROVING RECONSTRUCTION LOSS BASED SPEAKER EMBEDDING IN UNSUPERVISED AND SEMI-SUPERVISED SCENARIOS

Authors

Jaejin Cho, Piotr Zelasko, Center for Language Speech Processing at Johns Hopkins University, United States; Jesús Villalba, Johns Hopkins University, United States; Najim Dehak, Center for Language Speech Processing at Johns Hopkins University, United States

Session

SPE-38: Speaker Recognition 6: Self-supervised and Unsupervised Learning

Location

Gather.Town

Session Time:

Thursday, 10 June, 14:00 - 14:45

Presentation Time:

Thursday, 10 June, 14:00 - 14:45

Presentation

Poster

Topic

Speech Processing: [SPE-SPKR] Speaker Recognition and Characterization

IEEE Xplore Open Preview

Click here to view in IEEE Xplore

Abstract

Text-to-speech (TTS) models trained to minimize the spectrogram reconstruction loss can learn speaker embeddings without explicit speaker identity supervision, unlike x-vector speaker identification (SID) systems. Leveraging this way of speaker embedding learning can be useful in unsupervised/semi-supervised scenarios where non, or only some, of the training data have speaker labels. Thus, in this paper, we evaluate speaker embeddings learned by training the spectrogram prediction network under unsupervised/semi-supervised scenarios. We experimented with different data sampling strategies. The best one was sampling two different segments from the same utterance, namely \textit{A} and \textit{B}, where the spectrogram of {\it B} is predicted given the {\it B} phone sequence and the speaker embedding extracted from {\it A}. This method improved by 3.4\% relative in EER, compared to using the same utterance for both A and B without segmenting. In the unsupervised scenario, the best speaker embedding outperformed i-vectors, the state-of-the-art unsupervised speaker embedding, in speaker verification by 12.9\% relative in EER. We observed high correlation between reconstruction loss and speaker embedding quality. In the semi-supervised scenario, having more unlabeled data in training led to a better performance in speaker verification. Adding 5314 unlabeled speakers to 800 labeled speakers improved EER by 10.8 \% relative.

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

2021 IEEE International Conference on Acoustics, Speech and Signal Processing

6-11 June 2021 • Toronto, Ontario, Canada

Extracting Knowledge from Information

My ICASSP 2021 Schedule

Paper Detail