Print Page - Voice Swap using NVIDIA NeMo on Vultr Cloud GPU

Title: Voice Swap using NVIDIA NeMo on Vultr Cloud GPU
Post by: mahesh on Dec 23, 2023, 06:33 AM

Voice Swap using NVIDIA NeMo on Vultr Cloud GPU
(https://pix.cobrasoft.org/images/2023/12/23/TaYFWS2.png)
Introduction
Neural Modules (NeMo) is an open-source toolkit designed for users that work with conversational AI. It's part of the NVIDIA GPU Cloud (NGC) collection that includes a library of tools, and ready-to-use models designed to efficiently handle artificial Intelligence and high-performance computing projects.

This article explains how to perform voice swap using the NVIDIA NeMo framework on a Vultr Cloud GPU server. You are to perform tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) using a PyTotrch GPU accelerated container from the NGC Catalog. In addition, convert an English Male voice audio sample to an English Female voice audio sample by running the pre-trained NeMo models for Natural Processing Tasks (NLP).

Prerequisites
Before you begin, be sure to:

Deploy a fresh Ubuntu 22.04 A100 Vultr GPU Stack server using the Vultr Marketplace application with at least 20GB GPU RAM
Using SSH, access the server

Create a non-root user with sudo rights and switch to the account

Deploy the PyTorch GPU Container and Access Jupyter Notebook
In this section, you are to install and run the PyTorch GPU container with port binding and access the Jupyter Notebook pre-installed in the container.

1.Install and run the PyTorch GPU container

Code Select

 $ sudo docker run --gpus all -p 9000:8888 -it nvcr.io/nvidia/pytorch:23.09-py3

The above command runs the PyTorch GPU-accelerated container with the following values:

Code Select

--gpus all: Allocates all available host server GPU resources to the container
-p 9000:8888: Maps the host port 9000 to the PyTorch container port 8888 to access Jupyter Notebook different from the host instance
-it: Interactively starts a new shell session of the container terminal

When successful, verify that you can access the container shell

2.Start a new Jupyter Notebook instance

Code Select

 # jupyter notebook --ip=0.0.0.0

Your output should look like the one below:

Code Select

To access the notebook, open this file in a browser:
     file:///root/.local/share/jupyter/runtime/nbserver-369-open.html
 Or copy and paste this URL:
     http://hostname:8888/?token=c5b30aac114cd01d225975a9c57eafb630a5659dde4c65a8

Copy your generated access token to securely access the Jupyter Notebook instance in your web browser

3.In a web browser such as Chrome, access Jupyter Notebook using the generated access token

Code Select

 http://SERVER-IP:9000/?token=YOUR_TOKEN

Run the Pre-Trained Models
In this section, install the required libraries to use the models and necessary NeMo functions. Then, import the NeMo modules, initialize the pre-trained models, and perform voice swap tasks as described in the steps below.

1.Access your Jupyter Notebook web interface

2.On the middle right bar, click the New dropdown to reveal the options list

(https://pix.cobrasoft.org/images/2023/12/23/vksiEEV.jpg)
3.Click Notebook, and select Python 3 (ipykernel) to open a new file

4.In a new code cell, install dependency packages

Code Select

 !pip install Cython nemo_toolkit[all] hydra-core transformers sentencepiece webdataset youtokentome pyannote.metrics jiwer ijson sacremoses sacrebleu rouge_score einops unidic-lite mecab-python3 opencc pangu ipadic wandb nemo_text_processing pytorch-lightning

Below is what each package represents:

Cython: A Python superset that allows you to write C extensions for Python often used for performance optimization
nemo_toolkit[all]: A framework for building conversational AI models. The [all] flag installs all available NeMo components and dependencies
hydra-core: A framework for configuring complex applications. It's used to manage configuration settings in a clean and organized way
transformers: Works with pre-trained models in Natural Language Processing (NLP), including models like BERT, GPT-2, and more
sentencepiece: Handles text tokenization and segmentation, often used in NLP tasks
webdataset: Enables efficient data loading and augmentation, particularly useful in deep learning workflows
youtokentome: A subword tokenization library useful in language modeling tasks
pyannote.metrics: A toolkit for speaker diarization and audio analysis that contains evaluation metrics for these tasks
jiwer: A Computing the Word Error Rate (WER) library commonly used in Automatic Speech Recognition (ASR) and other speech processing tasks
ijson: A library for parsing large JSON documents incrementally. It's useful when working with large data files
sacremoses: Handles tokenization, detokenization, and various text processing tasks
sacrebleu: Evaluates machine translation quality using the BLEU metric
rouge_score: A library for computing the ROUGE evaluation metric that is often used in text summarization and machine translation
einops: Handles tensor operations and reshaping, which can be helpful in deep learning model development
unidic-lite: A morphological analysis dictionary, and it is a lightweight version of it
mecab-python3: A tokenizer and part-of-speech tagger. This package is the Python binding for MeCab
opencc: A simplified and traditional Chinese text conversion library
pangu: A Chinese text spacing library that adds spaces between Chinese characters
ipadic: A morphological analysis dictionary
wandb: Tracks and visualizes machine learning experiments
nemo_text_processing: Contains text processing utilities specific to the NVidia NeMo Toolkit
pytorch-lightning: A PyTorch lightweight wrapper that simplifies the training of Deep Learning models

5.Import the necessary modules

Code Select

import nemo
 import nemo.collections.asr as nemo_asr
 import nemo.collections.nlp as nemo_nlp
 import nemo.collections.tts as nemo_tts
 import IPython

Below is what each of the imported modules represents:

nemo: Allows you to access the NeMo functionalities and classes
nemo.collections.asr: Enables access to the NeMo ASR-related functionalities and models
nemo_nlp: Allows you to use the NeMo NLP-related tools, models, and utilities
nemo_tts: Allows you to use the NeMo TTS-related functionalities and models
IPython: Allows you to interactively run and experiment with NeMo code

6.Open the NGC NeMo catalog

Code Select

nemo_asr.models.EncDecCTCModel.list_available_models()
 nemo_tts.models.HifiGanModel.list_available_models()
 nemo_tts.models.FastPitchModel.list_available_models()

The above commands output the list of available models from the following catalogs

Automatic Speech Recognition (ASR) models that use the encode-decoder
Test-to-Speech models from the HifiGan and FastPitch categories respectively

From the above available list, use the following models:

stt_en_quartznet15x5: To handle speech recognition tasks, specific to only the English language
tts_en_fastpitch: Generate spectrogram for text input to text-to-speech, specific to the English language
tts_en_hifigan: Converts spectrograms into speech for TTS, specific to the English language

7.Download and initialize the models

Code Select

quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_en_quartznet15x5').cuda()
 spec_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda()
 vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()

The download and initialization may take up to 15 minutes to complete.

Perform Voice Swapping
1.Import the audio sample. Replace the URL with your desired audio source

Code Select

Audio_sample = '2086-149220-0033.wav'
 !wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
 IPython.display.Audio(Audio_sample)

The above command downloads an English audio .wav file with the male voice from the provided URL. Then, it uses IPython.display.Audio to display and play the audio in your Jupyter Notebook file.

2.Transcribe the audio sample

Code Select

 files = [Audio_sample]
 raw_text = ''
 text = ''
 for fname, transcription in zip(files,quartznet.transcribe(paths2audio_files=files)):
   raw_text = transcription

 text = raw_text[0]
 print(raw_text)

The above command processes the provided audio sample using the Quartznet model for transcription.

Output:

well i don't wish to see it any more observed phoebe turning away her eyes it is certainly very like the old portrait
3.Generate the spectrogram

Code Select

def text_to_audio(text):
   parsed = spec_generator.parse(text)
   spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
   audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
   return audio.to('cpu').detach().numpy()

In the above command, the text_to_audio function takes a transcript, parses it, and generates a spectrogram using the text-to-speech model tts_en_hifigan. This is a preprocessing step in text-to-speech synthesis, the spectrogram represents the special characteristics of the generated audio.

4.Generate the swapped audio

Code Select

 IPython.display.Audio(text_to_audio(raw_text),rate=22050)

The above command displays the swapped audio sample converted from a male English voice to a female English voice.

Conclusion
You have built an AI voice swap system using the NeMo framework pre-trained models running on an NGC GPU accelerated container. You converted an English Male voice audio sample to an English Female voice audio sample. Using NeMo modules and pre-trained models from the NGC catalog, the audio speech recognition pipeline becomes more efficient and convenient to use.

More Information
For more information, visit the following resources:

NGC Catalog Documentation
PyTorch GPU Container Image

Cobra Forum

Plesk Panel => Web Application => Topic started by: mahesh on Dec 23, 2023, 06:33 AM