From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Paper: Accepted by Interspeech 2020

Code: github

Authors: Zexin Cai, Chuxiong Zhang, Ming Li

Abstract: High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraint for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added loss related to the speaker identity, which is centralized to improve the speaker similarity between the synthesized speech and its natural reference audio. The model is trained and evaluated on publicly available datasets. Experimental results, including visualization on speaker embedding space, show significant improvement in terms of speaker identity cloning in the spectrogram level.
Ground truth Baseline Baseline - Text Indep FC - Text Dep FC - Text Indep
VCTK Test Set
p225: we have seen a copy of the report.
p231: tourism is a vital industry for scotland.
p340: He had been surprised by the local reaction.
p360: There is an ongoing problem with this company.
p361: Cost is a factor, but not the only one.
p363: We are not commenting at the moment.
p364: I love you so very much.
p376: It is a crisis of human rights, a crisis of leadership.
Visualization on speaker embedding space

Speaker embedding visualization for VCTK test set. (a) baseline & text-dependent speaker embedding; (b) FC & text-dependent speaker embedding; (c) baseline & text-independent speaker embedding; (d) FC & text-independent speaker embedding;

Ground truth Baseline Baseline - Text Indep FC - Text Dep FC - Text Indep
VCTK Validation set
p259: It was not hard to feel some sympathy for Baxter yesterday.
p288: However, Dundee deserved to win this game.
p314: We have no choice but to shut down.
p374: But every two minutes it was exploding.
Ground truth Baseline Baseline - Text Indep FC - Text Dep FC - Text Indep
Cross-domain: Librispeech
1322: This Damosel loved another knight that held her to paramour
1343: Those lawyers have given up all idea of proceeding against her.
2053: If so, Kate also should be included in the punishment.
2085: And he succeeded in selling them to a newly arrived German.
2769: He led the way to the front room suddenly he stopped.
6115: The fact that in some places successive terraces are found does not disprove the theory.
7245: But we stood whispering in the house and all we said was saved.

Speaker embedding visualization for Librispeech test set. (a) baseline & text-dependent speaker embedding; (b) FC & text-dependent speaker embedding; (c) baseline & text-independent speaker embedding; (d) FC & text-independent speaker embedding;

Cross-lingual Voice Cloning
Finnish Female Finnish Male German Female German Male Mandarin Female Mandarin Male
Reference audios
Synthesized Results
The problems I mentioned are not insurmountable.
There is a connection between all of this.
We'll get to the new product line later.