Cobra Forum

Plesk Panel => Web Application => Topic started by: mahesh on Dec 23, 2023, 06:33 AM

Title: Voice Swap using NVIDIA NeMo on Vultr Cloud GPU
Post by: mahesh on Dec 23, 2023, 06:33 AM

Voice Swap using NVIDIA NeMo on Vultr Cloud GPU
(https://pix.cobrasoft.org/images/2023/12/23/TaYFWS2.png)
Introduction
Neural Modules (NeMo) is an open-source toolkit designed for users that work with conversational AI. It's part of the NVIDIA GPU Cloud (NGC) collection that includes a library of tools, and ready-to-use models designed to efficiently handle artificial Intelligence and high-performance computing projects.

This article explains how to perform voice swap using the NVIDIA NeMo framework on a Vultr Cloud GPU server. You are to perform tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) using a PyTotrch GPU accelerated container from the NGC Catalog. In addition, convert an English Male voice audio sample to an English Female voice audio sample by running the pre-trained NeMo models for Natural Processing Tasks (NLP).

Prerequisites
Before you begin, be sure to:

Deploy the PyTorch GPU Container and Access Jupyter Notebook
In this section, you are to install and run the PyTorch GPU container with port binding and access the Jupyter Notebook pre-installed in the container.

1.Install and run the PyTorch GPU container

$ sudo docker run --gpus all -p 9000:8888 -it nvcr.io/nvidia/pytorch:23.09-py3
The above command runs the PyTorch GPU-accelerated container with the following values:

--gpus all: Allocates all available host server GPU resources to the container
-p 9000:8888: Maps the host port 9000 to the PyTorch container port 8888 to access Jupyter Notebook different from the host instance
-it: Interactively starts a new shell session of the container terminal
When successful, verify that you can access the container shell

2.Start a new Jupyter Notebook instance

# jupyter notebook --ip=0.0.0.0
Your output should look like the one below:

     To access the notebook, open this file in a browser:
     file:///root/.local/share/jupyter/runtime/nbserver-369-open.html
 Or copy and paste this URL:
     http://hostname:8888/?token=c5b30aac114cd01d225975a9c57eafb630a5659dde4c65a8
Copy your generated access token to securely access the Jupyter Notebook instance in your web browser

3.In a web browser such as Chrome, access Jupyter Notebook using the generated access token

http://SERVER-IP:9000/?token=YOUR_TOKEN
Run the Pre-Trained Models
In this section, install the required libraries to use the models and necessary NeMo functions. Then, import the NeMo modules, initialize the pre-trained models, and perform voice swap tasks as described in the steps below.

1.Access your Jupyter Notebook web interface

2.On the middle right bar, click the New dropdown to reveal the options list

(https://pix.cobrasoft.org/images/2023/12/23/vksiEEV.jpg)
3.Click Notebook, and select Python 3 (ipykernel) to open a new file

4.In a new code cell, install dependency packages

!pip install Cython nemo_toolkit[all] hydra-core transformers sentencepiece webdataset youtokentome pyannote.metrics jiwer ijson sacremoses sacrebleu rouge_score einops unidic-lite mecab-python3 opencc pangu ipadic wandb nemo_text_processing pytorch-lightning
Below is what each package represents:

5.Import the necessary modules

 import nemo
 import nemo.collections.asr as nemo_asr
 import nemo.collections.nlp as nemo_nlp
 import nemo.collections.tts as nemo_tts
 import IPython
Below is what each of the imported modules represents:

6.Open the NGC NeMo catalog

 nemo_asr.models.EncDecCTCModel.list_available_models()
 nemo_tts.models.HifiGanModel.list_available_models()
 nemo_tts.models.FastPitchModel.list_available_models()
The above commands output the list of available models from the following catalogs

From the above available list, use the following models:

7.Download and initialize the models

 quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_en_quartznet15x5').cuda()
 spec_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda()
 vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()
The download and initialization may take up to 15 minutes to complete.

Perform Voice Swapping
1.Import the audio sample. Replace the URL with your desired audio source

 Audio_sample = '2086-149220-0033.wav'
 !wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
 IPython.display.Audio(Audio_sample)
The above command downloads an English audio .wav file with the male voice from the provided URL. Then, it uses IPython.display.Audio to display and play the audio in your Jupyter Notebook file.

2.Transcribe the audio sample

files = [Audio_sample]
 raw_text = ''
 text = ''
 for fname, transcription in zip(files,quartznet.transcribe(paths2audio_files=files)):
   raw_text = transcription

 text = raw_text[0]
 print(raw_text)
The above command processes the provided audio sample using the Quartznet model for transcription.

Output:

 well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait
3.Generate the spectrogram

 def text_to_audio(text):
   parsed = spec_generator.parse(text)
   spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
   audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
   return audio.to('cpu').detach().numpy()
In the above command, the text_to_audio function takes a transcript, parses it, and generates a spectrogram using the text-to-speech model tts_en_hifigan. This is a preprocessing step in text-to-speech synthesis, the spectrogram represents the special characteristics of the generated audio.

4.Generate the swapped audio

IPython.display.Audio(text_to_audio(raw_text),rate=22050)
The above command displays the swapped audio sample converted from a male English voice to a female English voice.

Conclusion
You have built an AI voice swap system using the NeMo framework pre-trained models running on an NGC GPU accelerated container. You converted an English Male voice audio sample to an English Female voice audio sample. Using NeMo modules and pre-trained models from the NGC catalog, the audio speech recognition pipeline becomes more efficient and convenient to use.

More Information
For more information, visit the following resources: