NeuralVC

NeuralVC Any-to-Any Voice Conversion Using Neural Networks Decoder For Real-Time Voice Conversion

In this paper, we adopt the end-to-end VITS framework for high-quality waveform reconstruction. By introducing HuBERT-Soft, we extract clean speech content information, and by incorporating a pre-trained speaker encoder, we extract speaker characteristics from the speech. Inspired by the structure of speech compression models, we propose a neural decoder that synthesizes converted speech with the target speaker’s voice by adding preprocessing and conditioning networks to receive and interpret speaker information. Additionally, we significantly improve the model’s inference speed, achieving real-time voice conversion.

Audio samples:https://jinyuanzhang999.github.io/NeuralVC_Demo.github.io/

We also provide the pretrained models.

model framework
Model Framework

Pre-requisites

  1. Clone this repo: git clone https://github.com/zzy1hjq/NeutralVC.git

  2. CD into this repo: cd NeuralVC

  3. Install python requirements: pip install -r requirements.txt

  4. Download the VCTK dataset (for training only)

Inference Example

Download the pretrained checkpoints and run:

# inference with NeuralVC
# Replace the corresponding parameters
convert.ipynb

Training Example

  1. Preprocess

# run this if you want a different train-val-test split
python preprocess_flist.py

# run this if you want to use pretrained speaker encoder
python preprocess_spk.py

# run this if you want to use a different content feature extractor.
python preprocess_code.py

  1. Train
# train NeuralVC
python train.py


References