Cross-lingual Multi-speaker Speech Synthesis samples

Abstract

Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: 1) cross-lingual synthesis with sufficient data, 2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with language tokens. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis. Both methods can be used to generate fluent foreign speech and even code-switching speech for monolingual speakers. Moreover, experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases.

Original Voices

Data-sufficient

Data-insufficient

DB1
LJS
DB4-Mandarin
DB4-English
DB4-Code-switching

VCTK-P232
VCTK-P268
AISHELL-SSB0011
AISHELL-SSB0407

English Sentense Synthesis

Text

CLMS (data-sufficient)

BLMS

CLMS (data-insufficient)

Voice Conversion

Their solution requires development
of the human capacity for social interest.

DB4
LJS
DB1

LJS
DB1

DB4
P232
P268

P232
P268
SSB0011
SSB0407

His most significant scientific publications
were studies of birds and animals.

DB4
LJS
DB1

LJS
DB1

DB4
P232
P268

P232
P268
SSB0011
SSB0407

They established royal commissions to
recover illegally held church lands.

DB4
LJS
DB1

LJS
DB1

DB4
P232
P268

P232
P268
SSB0011
SSB0407

Unfortunately, others separate
on the basis of accumulated hatred.

DB4
LJS
DB1

LJS
DB1

DB4
P232
P268

P232
P268
SSB0011
SSB0407

Mandarin Sentense Synthesis

Text

CLMS (data-sufficient)

BLMS

CLMS (data-insufficient)

Voice Conversion

建筑设计师莱伊恩受命
设计了英国温泽市政府大厅

DB4
LJS
DB1

LJS
DB1

DB4
SSB0011
SSB0407

P232
P268
SSB0011
SSB0407

现在呢有很多朋友
都喜欢打游戏

DB4
LJS
DB1

LJS
DB1

DB4
SSB0011
SSB0407

P232
P268
SSB0011
SSB0407

堪忧本来就是令人担忧的意思

DB4
LJS
DB1

LJS
DB1

DB4
SSB0011
SSB0407

P232
P268
SSB0011
SSB0407

这个时候有两个身影
就向着火场逆行

DB4
LJS
DB1

LJS
DB1

DB4
SSB0011
SSB0407

P232
P268
SSB0011
SSB0407

Code-switching Sentense Synthesis

Text

CLMS (data-sufficient)

BLMS

CLMS (data-insufficient)

Voice Conversion

用UC浏览器搜索
I believe I can fly

(Use the UC browser to search
I believe I can fly)

DB4
LJS
DB1

LJS
DB1

DB4

P232
P268
SSB0011
SSB0407

我那个 company culture 其实
就是 everyone help each other

(Our company culture is that
everyone help each other)

DB4
LJS
DB1

LJS
DB1

DB4

P232
P268
SSB0011
SSB0407

most of the time
就是去玩一下

(most of the time,
we just hang around)

DB4
LJS
DB1

LJS
DB1

DB4

P232
P268
SSB0011
SSB0407

其实我很难判断 in my heart
i think my chinese is better but people
tell me that my english 是比较好

(Actually, it's hard for me to tell. In my heart,
I think my Chinese is better, but people
tell me that my English is better)

DB4
LJS
DB1

LJS
DB1

DB4

P232
P268
SSB0011
SSB0407