Abstract


Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: 1) cross-lingual synthesis with sufficient data, 2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with language tokens. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis. Both methods can be used to generate fluent foreign speech and even code-switching speech for monolingual speakers. Moreover, experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases.

Original Voices


Data-sufficient Data-insufficient
DB1
LJS
DB4-Mandarin
DB4-English
DB4-Code-switching
VCTK-P232
VCTK-P268
AISHELL-SSB0011
AISHELL-SSB0407

English Sentense Synthesis


Text CLMS (data-sufficient) BLMS CLMS (data-insufficient) Voice Conversion
Their solution requires development
of the human capacity for social interest.
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
P232
P268
SSB0011
SSB0407
His most significant scientific publications
were studies of birds and animals.
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
P232
P268
SSB0011
SSB0407
They established royal commissions to
recover illegally held church lands.
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
P232
P268
SSB0011
SSB0407
Unfortunately, others separate
on the basis of accumulated hatred.
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
P232
P268
SSB0011
SSB0407

Mandarin Sentense Synthesis


Text CLMS (data-sufficient) BLMS CLMS (data-insufficient) Voice Conversion
建筑设计师莱伊恩受命
设计了英国温泽市政府大厅
DB4
LJS
DB1
LJS
DB1
DB4
SSB0011
SSB0407
P232
P268
SSB0011
SSB0407
现在呢有很多朋友
都喜欢打游戏
DB4
LJS
DB1
LJS
DB1
DB4
SSB0011
SSB0407
P232
P268
SSB0011
SSB0407
堪忧本来就是令人担忧的意思
DB4
LJS
DB1
LJS
DB1
DB4
SSB0011
SSB0407
P232
P268
SSB0011
SSB0407
这个时候有两个身影
就向着火场逆行
DB4
LJS
DB1
LJS
DB1
DB4
SSB0011
SSB0407
P232
P268
SSB0011
SSB0407

Code-switching Sentense Synthesis


Text CLMS (data-sufficient) BLMS CLMS (data-insufficient) Voice Conversion
用UC浏览器搜索
I believe I can fly
(Use the UC browser to search
I believe I can fly)
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
SSB0011
SSB0407
我那个 company culture 其实
就是 everyone help each other
(Our company culture is that
everyone help each other)
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
SSB0011
SSB0407
most of the time
就是去玩一下
(most of the time,
we just hang around)
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
SSB0011
SSB0407
其实我很难判断 in my heart
i think my chinese is better but people
tell me that my english 是比较好
(Actually, it's hard for me to tell. In my heart,
I think my Chinese is better, but people
tell me that my English is better)
DB4
LJS
DB1
LJS
DB1
DB4
P232
P268
SSB0011
SSB0407