Cobra Forum

Plesk Panel => Web Application => Topic started by: mahesh on Dec 23, 2023, 07:16 AM

Title: How to Build Voice Translation Application Using NVIDIA NeMo
Post by: mahesh on Dec 23, 2023, 07:16 AM
How to Build Voice Translation Application Using NVIDIA NeMo
(https://pix.cobrasoft.org/image/tRMV)
Introduction
Neural Modules (NeMo) is an open-source toolkit designed to handle conversational AI tasks. It's part of NVIDIA's GPU Cloud (NGC) catalog which consists of a centralized repository of tools, frameworks, and pre-trained models. These models speed up the development, deployment, and management of Artificial Intelligence and high-performance computing workloads. NGC GPU accelerated containers are also an essential part of the NGC catalog pre-configured with optimized software and libraries to take advantage of GPU resources for accelerated performance.

This article explains how to use the NeMo framework in a GPU-accelerated PyTorch container to perform Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) tasks. You are to install and run the PyTorch container, then, use NeMo pre-trained models to convert imported French audio into English audio.

Prerequisites
Before you begin:

# su sysadmin
Install PyTorch and Access Jupyter Notebook
To use the NeMo framework on a cloud GPU server, install and run the PyTorch GPU container with port binding using Docker. Then, access the Jupyter Notebook service pre-installed in the container as described in the steps below.

1.Using Docker, install and run the PyTorch GPU container

$ sudo docker run --gpus all -p 9000:8888 -it nvcr.io/nvidia/pytorch:23.09-py3
The above command runs the PyTorch GPU-accelerated docker container with the following configurations:

When successful, verify that your server prompt changes to the root container shell

root@4a09da260af2:/workspace#
2.Start Jupyter Notebook as a background process

# jupyter notebook --ip=0.0.0.0 &
Your output should look like the one below:

     To access the notebook, open this file in a browser:
     file:///root/.local/share/jupyter/runtime/nbserver-369-open.html
 Or copy and paste this URL:
     http://hostname:8888/?token=c5b30aac114cd01d225975a9c57eafb630a5659dde4c65a8
As displayed in the above output, copy the generated token ?token=XXXX to securely access Jupyter Notebook in your web browser.

3.Using a web browser such as Chrome, access Jupyter Notebook on your public server IP on port 9000 using the generated access token

http://SERVER-IP:9000/?token=YOUR_TOKEN
(https://pix.cobrasoft.org/images/2023/12/23/aZ1rfBy.png)
Run the Pre-Trained Models
To use pre-trained models and necessary NeMo functions, import the NeMo modules. Then, initialize the pre-trained models, and perform tasks like audio transcription and text-to-speech synthesis in a Jupyter Notebook session as described below.

1.Access the Jupyter Notebook web interface

2.In the middle right corner, click the New dropdown to reveal a dropdown list

Create a new Jupyter Notebook

3.Select Python 3 (ipykernel) under the Notebook: category to open a new file

4.Within the new Jupyter Notebook file, add the following code in a new cell to install the necessary dependency packages

!pip install Cython nemo_toolkit[all] hydra-core transformers sentencepiece webdataset youtokentome pyannote.metrics jiwer ijson sacremoses sacrebleu rouge_score einops unidic-lite mecab-python3 opencc pangu ipadic wandb nemo_text_processing pytorch-lightning
Below is what each package represents:

5.Press Run on the main menu bar or press CTRL + ENTER to install the packages

6.In a new code cell, import the necessary modules

 import nemo
 import nemo.collections.asr as nemo_asr
 import nemo.collections.nlp as nemo_nlp
 import nemo.collections.tts as nemo_tts
 import IPython
The above commands import the necessary modules required to run the NeMo pre-trained models. Below is what each module represents:

7.Open the NGC NeMo catalog

 nemo_asr.models.EncDecCTCModel.list_available_models()
 nemo_nlp.models.MTEncDecModel.list_available_models()
 nemo_tts.models.HifiGanModel.list_available_models()
 nemo_tts.models.FastPitchModel.list_available_models()
The above command lists all available models in the following categories:

Automatic speech recognition
8.Initialize the models
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_fr_quartznet15x5').cuda()
 nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_fr_en_transformer12x2').cuda()
 spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda()
 vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name='tts_en_lj_hifigan_ft_mixertts').cuda()
Wait for at least 15 minutes for the initialization process to complete successfully

Perform Audio Transcription and Synthesis
1.Download a French audio sample. Replace the link with your desired audio source URL

 !wget 'https://lightbulblanguages.co.uk/resources/audio/bonjour.mp3'
 audio_sample = 'bonjour.mp3'
 IPython.display.Audio(audio_sample)
The above command downloads the public French MP3 sample audio file bonjour.mp3 and saves it in your Jupyter Notebook working directory. In addition, it uses IPython's Audio widget to display and play the audio file in your Jupyter Notebook session

2.Transcribe the audio sample to text

transcribed_text = asr_model.transcribe([audio_sample])
 print(transcribed_text)
The above command uses the speech recognition model and displays the transcribed text from the audio content

Output:

['bonjour']
3.Translate the text to English

 english_text = nmt_model.translate(transcribed_text)
 print(english_text)
The above command uses the pre-trained model to convert the French text to English and displays the converted text

Output:

['hello']
4.Generate a Spectogram

 parseText = spectrogram_generator.parse(english_text[0])
 spectrogram = spectrogram_generator.generate_spectrogram(tokens=parseText)
The above command converts the English text into a spectrogram, this is a preprocessing step in text-to-speech synthesis, the spectrogram represents the special characteristics of the generated audio

5.Convert the spectrogram to audio

 audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
 audioOutput = audio.to('cpu').detach().numpy()
The above command processes the input text to a TTS pipeline to generate the audio output

6.View the transcribed audio

IPython.display.Audio(audioOutput,rate=22050)
Verify that the generated transcribed audio matches your English text at a rate of 22050 Hz

Conclusion
You have built an AI Translator using the NeMo framework pre-trained models and the NGC GPU accelerated container on a Vultr Cloud GPU Server. You transcribed a French audio sample to French text, then, you converted the text to English text and transcribed the text to an English audio sample. Using NeMo modules and pre-trained models from the NGC catalog, the audio speech recognition pipeline becomes efficient and convenient to use.

More Information
For more information, visit the following documentation resources: