NeuralVC

NeuralVC Any-to-Any Voice Conversion Using Neural Networks Decoder For Real-Time Voice Conversion

In this paper, we adopt the end-to-end VITS framework for high-quality waveform reconstruction. By introducing HuBERT-Soft, we extract clean speech content information, and by incorporating a pre-trained speaker encoder, we extract speaker characteristics from the speech. Inspired by the structure of speech compression models, we propose a neural decoder that synthesizes converted speech with the target speaker’s voice by adding preprocessing and conditioning networks to receive and interpret speaker information. Additionally, we significantly improve the model’s inference speed, achieving real-time voice conversion.

Audio samples:https://jinyuanzhang999.github.io/NeuralVC_Demo.github.io/

We also provide the pretrained models.

Model Framework

Pre-requisites

Clone this repo: git clone https://github.com/zzy1hjq/NeutralVC.git
CD into this repo: cd NeuralVC
Install python requirements: pip install -r requirements.txt
Download the VCTK dataset (for training only)

Inference Example

Download the pretrained checkpoints and run:

# inference with NeuralVC
# Replace the corresponding parameters
convert.ipynb

Training Example

Preprocess

# run this if you want a different train-val-test split
python preprocess_flist.py

# run this if you want to use pretrained speaker encoder
python preprocess_spk.py

# run this if you want to use a different content feature extractor.
python preprocess_code.py

Train

# train NeuralVC
python train.py

References

https://github.com/jaywalnut310/vits
https://github.com/OlaWod/FreeVC
https://github.com/quickvc/QuickVC-VoiceConversion
https://github.com/facebookresearch/encodec

This site is open source. Improve this page.