In this paper, we adopt the end-to-end VITS framework for high-quality waveform reconstruction. By introducing HuBERT-Soft, we extract clean speech content information, and by incorporating a pre-trained speaker encoder, we extract speaker characteristics from the speech. Inspired by the structure of speech compression models, we propose a neural decoder that synthesizes converted speech with the target speaker’s voice by adding preprocessing and conditioning networks to receive and interpret speaker information. Additionally, we significantly improve the model’s inference speed, achieving real-time voice conversion.
Audio samples:https://jinyuanzhang999.github.io/NeuralVC_Demo.github.io/
We also provide the pretrained models.
![]() |
Model Framework |
---|
Clone this repo: git clone https://github.com/zzy1hjq/NeutralVC.git
CD into this repo: cd NeuralVC
Install python requirements: pip install -r requirements.txt
Download the VCTK dataset (for training only)
Download the pretrained checkpoints and run:
# inference with NeuralVC
# Replace the corresponding parameters
convert.ipynb
# run this if you want a different train-val-test split
python preprocess_flist.py
# run this if you want to use pretrained speaker encoder
python preprocess_spk.py
# run this if you want to use a different content feature extractor.
python preprocess_code.py
# train NeuralVC
python train.py