Cross-lingual Speech Synthesis

Modeling voices for multiple speakers and multiple languages

References

2023

Cross-Lingual Multi-Speaker Speech Synthesis with Limited Bilingual Training Data

Zexin Cai, Yaogen Yang, and Ming Li

Computer Speech & Language, 2023

Abs Bib Paper Website

Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: (1) cross-lingual synthesis with sufficient data, (2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with numeric language ID codes. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis, which can deliver fluent foreign speech and even code-switching speech for monolingual speakers.
@article{CAI2023cross, title = {Cross-Lingual Multi-Speaker Speech Synthesis with Limited Bilingual Training Data}, journal = {Computer Speech & Language}, volume = {77}, pages = {101427}, year = {2023}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2022.101427}, author = {Cai, Zexin and Yang, Yaogen and Li, Ming}, }