Zexin Cai (蔡泽鑫)

Zexin Cai is a postdoctoral research fellow at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University, advised by Matthew and Nicholas. He received his PhD in Electrical and Computer Engineering from Duke University in 2023, supervised by Prof. Ming Li and Prof. Xin Li. His research interests include text-to-speech synthesis, voice conversion, and audio deepfake detection. Prior to joining Duke, Zexin earned his Bachelor’s degree in Software Engineering from Sun Yat-sen University and served as a research assistant at Duke Kunshan University. During his PhD studies, he completed an internship as an Applied Research Scientist at Microsoft. Zexin has contributed to various publications, with papers presented at ICASSP, Interspeech, and in the journal Computer Speech & Language.

Selected Publications

GenVC: Self-Supervised Zero-Shot Voice Conversion

Zexin Cai, Henry Li Xinyuan, Ashi Grag, and 5 more authors

In IEEE Workshop on Automatic Speech Recognition and Understanding, 2025

Abs Bib Paper Website

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.
@inproceedings{CAI2025genVC, author = {Cai, Zexin and Li Xinyuan, Henry and Grag, Ashi and Garc\'ia-Perera, Leibny Paola and Duh, Kevin and Khudanpur, Sanjeev and Wiesner, Matthew and Andrews, Nicholas}, booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding}, title = {GenVC: Self-Supervised Zero-Shot Voice Conversion}, year = {2025}, volume = {}, number = {}, pages = {}, }
Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization

Zexin Cai, Henry Li Xinyuan, Ashi Grag, and 5 more authors

In IEEE Spoken Language Technology Workshop, 2024

Abs Bib Paper Website

Advances in speech technology now allow unprecedented access to personally identifiable information through speech. To protect such information, the differential privacy field has explored ways to anonymize speech while preserving its utility, including linguistic and paralinguistic aspects. However, anonymizing speech while maintaining emotional state remains challenging. We explore this problem in the context of the VoicePrivacy 2024 challenge. Specifically, we developed various speaker anonymization pipelines and find that approaches either excel at anonymization or preserving emotion state, but not both simultaneously. Achieving both would require an in-domain emotion recognizer. Additionally, we found that it is feasible to train a semi-effective speaker verification system using only emotion representations, demonstrating the challenge of separating these two modalities.
@inproceedings{CAI2024privacy, author = {Cai, Zexin and Li Xinyuan, Henry and Grag, Ashi and Garc\'ia-Perera, Leibny Paola and Duh, Kevin and Khudanpur, Sanjeev and Andrews, Nicholas and Wiesner, Matthew}, booktitle = {IEEE Spoken Language Technology Workshop}, title = {Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization}, year = {2024}, volume = {}, number = {}, pages = {409-414}, }
Integrating Frame-Level Boundary Detection and Deepfake Detection for Locating Manipulated Regions in Partially Spoofed Audio Forgery Attacks

Zexin Cai, and Ming Li

Computer Speech & Language, 2024

Abs Bib Paper

Partially fake audio, a variant of deep fake that involves manipulating audio utterances through the incorporation of fake or externally-sourced bona fide audio clips, constitutes a growing threat as an audio forgery attack impacting both human and artificial intelligence applications. Researchers have recently developed valuable databases to aid in the development of effective countermeasures against such attacks. While existing countermeasures mainly focus on identifying partially fake audio at the level of entire utterances or segments, this paper introduces a paradigm shift by proposing frame-level systems. These systems are designed to detect manipulated utterances and pinpoint the specific regions within partially fake audio where the manipulation occurs. Our approach leverages acoustic features extracted from large-scale self-supervised pre-training models, delivering promising results evaluated on diverse, publicly accessible databases. Additionally, we study the integration of boundary and deepfake detection systems, exploring their potential synergies and shortcomings. Importantly, our techniques have yielded impressive results. We have achieved state-of-the-art performance on the test dataset of the Track 2 of ADD 2022 challenge with an equal error rate of 4.4%. Furthermore, our methods exhibit remarkable performance in locating manipulated regions in Track 2 of the ADD 2023 challenge, resulting in a final ADD score of 0.6713 and securing the top position.
@article{CAI2024integrating, title = {Integrating Frame-Level Boundary Detection and Deepfake Detection for Locating Manipulated Regions in Partially Spoofed Audio Forgery Attacks}, journal = {Computer Speech & Language}, volume = {85}, pages = {101597}, year = {2024}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2023.101597}, author = {Cai, Zexin and Li, Ming}, }
Waveform Boundary Detection for Partially Spoofed Audio

Zexin Cai, Weiqing Wang, and Ming Li

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

Abs Bib Paper Code

The present paper proposes a waveform boundary detection system for audio spoofing attacks containing partially manipulated segments. Partially spoofed/fake audio, where part of the utterance is replaced, either with synthetic or natural audio clips, has recently been reported as one scenario of audio deepfakes. As deepfakes can be a threat to social security, the detection of such spoofing audio is essential. Accordingly, we propose to address the problem with a deep learning-based frame-level detection system that can detect partially spoofed audio and locate the manipulated pieces. Our proposed method is trained and evaluated on data provided by the ADD2022 Challenge. We evaluate our detection model concerning various acoustic features and network configurations. As a result, our detection system achieves an equal error rate (EER) of 6.58% on the ADD2022 challenge test set, which is the best performance in partially spoofed audio detection systems that can locate manipulated clips.
@inproceedings{CAI2023waveform, author = {Cai, Zexin and Wang, Weiqing and Li, Ming}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title = {Waveform Boundary Detection for Partially Spoofed Audio}, year = {2023}, volume = {}, number = {}, pages = {1-5}, doi = {10.1109/ICASSP49357.2023.10094774}, }
INVERTIBLE VOICE CONVERSION WITH PARALLEL DATA

Zexin Cai, and Ming Li

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

Abs Bib Paper Website

This paper introduces an innovative deep learning framework for parallel voice conversion to mitigate inherent risks associated with such systems. Our approach focuses on developing an invertible model capable of countering potential spoofing threats. Specifically, we present a conversion model that allows for the retrieval of source voices, thereby facilitating the identification of the source speaker. This framework is constructed using a series of invertible modules composed of affine coupling layers to ensure the reversibility of the conversion process. We conduct comprehensive training and evaluation of the proposed framework using parallel training data. Our experimental results reveal that this approach achieves comparable performance to non-invertible systems in voice conversion tasks. Notably, the converted outputs can be seamlessly reverted to the original source inputs using the same parameters employed during the forwarding process. This advancement holds considerable promise for elevating the security and reliability of voice conversion.
@inproceedings{CAI2024invertible, author = {Cai, Zexin and Li, Ming}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title = {INVERTIBLE VOICE CONVERSION WITH PARALLEL DATA}, year = {2024}, volume = {}, number = {}, pages = {}, }
Cross-Lingual Multi-Speaker Speech Synthesis with Limited Bilingual Training Data

Zexin Cai, Yaogen Yang, and Ming Li

Computer Speech & Language, 2023

Abs Bib Paper Website

Modeling voices for multiple speakers and multiple languages with one speech synthesis system has been a challenge for a long time, especially in low-resource cases. This paper presents two approaches to achieve cross-lingual multi-speaker text-to-speech (TTS) and code-switching synthesis under two training scenarios: (1) cross-lingual synthesis with sufficient data, (2) cross-lingual synthesis with limited data per speaker. Accordingly, a novel TTS synthesis model and a non-autoregressive multi-speaker voice conversion model are proposed. The TTS model designed for sufficient-data cases has a Tacotron-based structure that uses shared phonemic representations associated with numeric language ID codes. As for the data-limited scenario, we adopt a framework cascading several speech modules to achieve our goal. In particular, we proposed a non-autoregressive many-to-many voice conversion module to address multi-speaker synthesis for data-insufficient cases. Experimental results on speaker similarity show that our proposed voice conversion module can maintain the voice characteristics well in data-limited cases. Both approaches use limited bilingual data and demonstrate impressive performance in cross-lingual synthesis, which can deliver fluent foreign speech and even code-switching speech for monolingual speakers.
@article{CAI2023cross, title = {Cross-Lingual Multi-Speaker Speech Synthesis with Limited Bilingual Training Data}, journal = {Computer Speech & Language}, volume = {77}, pages = {101427}, year = {2023}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2022.101427}, author = {Cai, Zexin and Yang, Yaogen and Li, Ming}, }
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Zexin Cai, Chuxiong Zhang, and Ming Li

In Conference of the International Speech Communication Association (INTERSPEECH), 2020

Abs Bib Paper Code Website

High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraints for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added loss related to the speaker identity, which is centralized to improve the speaker similarity between the synthesized speech and its natural reference audio. The model is trained and evaluated on publicly available datasets. Experimental results, including visualization on speaker embedding space, show significant improvement in terms of speaker identity cloning in the spectrogram level. In addition, synthesized samples are available online for listening.
@inproceedings{cai2020from, author = {Cai, Zexin and Zhang, Chuxiong and Li, Ming}, title = {{From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint}}, year = {2020}, booktitle = {Conference of the International Speech Communication Association (INTERSPEECH)}, pages = {3974--3978}, doi = {10.21437/Interspeech.2020-1032}, }
Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Zexin Cai, Chuxiong Zhang, and Ming Li

In Conference of the International Speech Communication Association (INTERSPEECH), 2019

Abs Bib Paper

This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, followed by a prediction network that maps the polyphonic character embeddings along with the conditions to corresponding pronunciations. We obtain the word-level condition from a pre-trained word-to-vector lookup table. One goal of polyphone disambiguation is to address the homograph problem existing in the front-end processing of Mandarin Chinese text-to-speech system. Our system achieves an accuracy of 94.69% on a publicly available polyphonic character dataset. To further validate our choices on the conditional feature, we investigate polyphone disambiguation systems with multi-level conditions respectively. The experimental results show that both the sentence-level and the word-level conditional embedding features are able to attain good performance for Mandarin Chinese polyphone disambiguation.
@inproceedings{cai2019polyphone, author = {Cai, Zexin and Zhang, Chuxiong and Li, Ming}, title = {{Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features}}, year = {2019}, booktitle = {Conference of the International Speech Communication Association (INTERSPEECH)}, pages = {2110--2114}, doi = {10.21437/Interspeech.2019-1235}, }
F0 Contour Estimation Using Phonetic Feature in Electrolaryngeal Speech Enhancement

Zexin Cai, Zhicheng Xu, and Ming Li

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019

Abs Bib Paper

Pitch plays a significant role in understanding a tone based language like Mandarin. In this paper, we present a new method that estimates F0 contour for electrolaryngeal (EL) speech enhancement in Mandarin. Our system explores the usage of phonetic feature to improve the quality of EL speech. First, we train an acoustic model for EL speech and generate the phoneme posterior probabilities feature sequence for each input EL speech utterance. Then we employ the phonetic feature for F0 contour generation rather than the acoustic feature. The experimental results indicate that the EL speech is significantly enhanced under the adoption of the phonetic feature. Experimental results demonstrate that the proposed method achieves notable improvement regarding the intelligibility and the similarity with normal speech.
@inproceedings{cai2019f0contour, author = {Cai, Zexin and Xu, Zhicheng and Li, Ming}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title = {F0 Contour Estimation Using Phonetic Feature in Electrolaryngeal Speech Enhancement}, year = {2019}, volume = {}, number = {}, pages = {6490-6494}, doi = {10.1109/ICASSP.2019.8683435}, }