Abstract
Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: 1) cross-lingual synthesis with sufficient data, 2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with language tokens. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis. Both methods can be used to generate fluent foreign speech and even code-switching speech for monolingual speakers. Moreover, experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases.