Cobra Forum

Plesk Panel => Web Application => Topic started by: mahesh on Jan 05, 2024, 06:51 AM

Title: How to Use Meta Llama 2 Large Language Model on Vultr Cloud GPU
Post by: mahesh on Jan 05, 2024, 06:51 AM
Introduction
Llama 2 Large Language Model (LLM) is a successor to the Llama 1 model released by Meta. Primarily, Llama 2 models are available in three model flavors that depending on their parameter scale range from 7 billion to 70 billion, these are Llama-2-7b, Llama-2-13b, and Llama-2-70b. Llama 2 LLM models have a commercial, and open-source license for research and non-commercial use.

This article explains how to use the Meta Llama 2 large language model (LLM) on a Vultr Cloud GPU server. You are to initialize the Llama-2-70b-hf and Llama-2-70b-chat-hf models with quantization, then compare model weights in the Llama 2 LLM family.

Prerequisites
Before you begin:

Deploy a new Ubuntu 22.04 A100 Vultr Cloud GPU Server with at least:


Access the Llama 2 LLM Model
In this section, configure your HuggingFace account to access and download the Llama 2 family of models.

1.Request access to Llama2 through the official Meta downloads page.

(https://pix.cobrasoft.org/images/2024/01/05/GW6V2RI.png)
When prompted, enter the same email address as your HuggingFace account, and wait for a Meta confirmation email.

(https://pix.cobrasoft.org/images/2024/01/05/qXyaaRa.png)
4.Click the New token button to set up a new access token.

5.Give the token a name for example: meta-llama, set the role to read, and click the Generate a Token button to save.

(https://pix.cobrasoft.org/images/2024/01/05/5jBvxYB.png)
6.Click the Show option to reveal your token in plain text. Copy the token to your clipboard.

7.In your Hugging Face interface, enter Llama-2-7b in the search bar to open the model page.

8.Click the checkbox to share your information with Meta, and click Submit to request access to the model repository.

(https://pix.cobrasoft.org/images/2024/01/05/dVgwu1B.png)
When successful, you should receive a confirmation email from HuggingFace accepting your request to access the model. This confirms that you can use the model files as permitted by the Meta terms and conditions.

Install the CUDA Toolkit
To run Llama 2 models with lower precision settings, the CUDA toolkit is essential. Install the toolkit to install the libraries needed to write and compile GPU-accelerated applications using CUDA as described in the steps below.

1.Download the latest CUDA toolkit version.

$ wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
2.Initialize the CUDA toolkit installation.

$ sudo sh cuda_11.8.0_520.61.05_linux.run
When prompted, read the CUDA terms and conditions. Enter accept to agree to the toolkit license. Then, in the installation prompt, press SPACE to deselect all any provided options, and only keep the CUDA toolkit selected. Using arrow keys, scroll to the Install option and press ENTER to start the installation process.

3.Using echo, append the following configurations at the end of the ~/.bashrc file.

$ echo " export PATH=$PATH:/usr/local/cuda-11.8/bin
          export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64 " >> ~/.bashrc
The above configuration lines declare the environment variable configurations that allow your system to use the CUDA toolkit and its libraries.

4.Using a text editor such as Vim, edit the /etc/ld.so.conf/cuda-11-8.conf file.

$ sudo vim /etc/ld.so.conf.d/cuda-11-8.conf
5.Add the following configuration at the beginning of the file.

/usr/local/cuda-11.8/lib64
Save and close the file.

6.To save the configuration, end your SSH session.

$ exit
7.Start a new SSH session.

$ ssh example-user@SERVER-IP
8.Run the following ldconfig command to update the linker cache, and refresh information about shared libraries on your server.

$ sudo ldconfig
Install Model Dependencies
To use the model features and tools, install Jupyter Notebook to run commands, then install the required libraries as described in the steps below.

1.Install PyTorch.

$ pip3 install torch --index-url https://download.pytorch.org/whl/cu118
The above command installs the PyTorch library that offers efficient tensor computations and supports GPU acceleration for training operations.

To install a PyTorch version that matches your CUDA visit the documentation page to set preferences and run the install command.

2.Install dependency packages.

$ pip3 install bitsandbytes scipy transformers accelerate einops xformers
Below is what each package represents:

3.Install the Jupyter notebook package.

$ pip3 install notebook
4.Allow incoming connections to the Jupyter Notebook port 8888.

$ sudo ufw allow 8888
5.Start Jupyter Notebook.

$ jupyter notebook --ip=0.0.0.0
If you receive the following error:

 Command 'jupyter' not found, but can be installed with:
End your SSH connection, and reconnect to the server to refresh the cache.

When successful, Jupyter Notebook should start with the following output:

 [I 2023-07-31 00:29:42.997 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
 [W 2023-07-31 00:29:42.999 ServerApp] No web browser found: Error('could not locate runnable browser').
 [C 2023-07-31 00:29:42.999 ServerApp]

     To access the server, open this file in a browser:
         file:///home/example-user/.local/share/jupyter/runtime/jpserver-69912-open.html
     Or copy and paste one of these URLs:
         http://HOSTNAME:8888/tree?token=e536707fcc573e0f19be40d90902825ec6e04181bed85be9
         http://127.0.0.1:8888/tree?token=e536707fcc573e0f19be40d90902825ec6e04181bed85be9
As displayed in the above output, copy the generated token URL to securely access Jupyter Notebook in your browser.

6.In a web browser such as Chrome, access Jupyter Notebook using your generated access token.

http://SERVER-IP:8888/tree?token=YOUR=TOKEN
Run Llama 2 70B Model
In this section, initialize the Llama-2-70b-hf model in 4-bit and 16-bit precision, and add your Hugging Face authorization key to initialize the model pipeline and tokenizer as described in the steps below.

1.Access the Jupyter Notebook web interface.

2.On the top right bar, click New to reveal a dropdown list.

(https://pix.cobrasoft.org/images/2024/01/05/vksiEEV-1.jpg)
3.Click Notebook, and select Python 3 (ipykernel) to open a new file.

4.In the new Kernel file, click the filename, by default, it's set to Untitled.

5.Rename the file to Llama-2-70b, and press :key:Enter: to save the new filename.

(https://pix.cobrasoft.org/images/2024/01/05/j9VPuZm.png)
6.In a new code cell, initialize the Llama-2-70b-hf model.

from torch import cuda, bfloat16
 import transformers

 model_id = 'meta-llama/Llama-2-70b-hf'

 device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

 quant_config = transformers.BitsAndBytesConfig(
     load_in_4bit=True,
     bnb_4bit_quant_type='nf4',
     bnb_4bit_use_double_quant=True,
     bnb_4bit_compute_dtype=bfloat16
 )

 auth_token = 'YOUR_AUTHORIZATION_TOKEN'

 model_config = transformers.AutoConfig.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )

 model = transformers.AutoModelForCausalLM.from_pretrained(
     model_id,
     trust_remote_code=True,
     config=model_config,
     quantization_config=quant_config,
     use_auth_token=auth_token
 )

 model.eval()
 print(f"Model loaded on {device}")
Paste your Hugging Face token next to the auth_token = directive to replace YOUR-AUTHORIZATION_TOKEN.

The above code sets the model_id and enables 4-bit quantization with bitsandbytes. This applies 4-bit to less relevant parts of the model and 16-bit quantization to the text-generation parts of the model. In 16-bit, the output is less degraded providing near-accurate information.

7.Click the play button on the top menu bar, or press CTRL + ENTER to run the initialize the model.

When successful, the code prints the device it runs on, and shows the model is successfully downloaded. The download process may take about 30 minutes to complete.

8.a new code cell, initialize the tokenizer.

 tokenizer = transformers.AutoTokenizer.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )
The above code sets the tokenizer to model_id. Every LLM has a different tokenizer that converts text streams to smaller units for the language model to understand and interpret the input.

9.Initialize the pipeline.

pipe = transformers.pipeline(
     model=model,
     tokenizer=tokenizer,
     task='text-generation',
     temperature=0.0,
     max_new_tokens=50, 
     repetition_penalty=1.1
 )
The above code initializes the pipeline for text generation through which you can manipulate the kind of response to generate using the model. To enhance the output, the pipeline accepts additional parameters.

10.Run the following code to add a text prompt to the pipeline. Replace Hello World with your desired prompt.

 result = pipe('Hello World')[0]['generated_text']
 print(result)
The above code block generates output based on the input prompt. To generate a response, it can take up to 5 minutes to complete.

11.Verify the GPU usage statistics.

!nvidia-smi
Output:

 +-----------------------------------------------------------------------------+
 |  Processes:                                                                 |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0    0    0      35554      C   /usr/bin/python3                37666MiB |
 +-----------------------------------------------------------------------------+
As displayed in the above output, the Llama-2-7b-hf model uses 37.6 GB of GPU memory when executed with 4-bit precision and quantization. In full precision, the model VRAM consumption is much higher.

Run the Llama 2 70B Chat Model
In this section, initialize the Llama-2-70b-chat-hf fine-tuned model with 4-bit and 16-bit precision as described in the following steps.

1.On the main menu bar, click Kernel, and select Restart and Clear Outputs of All Cells to free up the GPU memory.

(https://pix.cobrasoft.org/images/2024/01/05/lpj1FyM.png)
2.Click File, select the New dropdown, and create a new Notebook.

3.Rename the notebook to Llama-2-7b-chat-hf.

4.Initialize the Llama-2-70b-chat-hf model. Replace AUTHORIZATION_TOKEN with your Hugging Face access token on the auth_token = directive.

 from torch import cuda, bfloat16
 import transformers

 model_id = 'meta-llama/Llama-2-70b-chat-hf'

 device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

 quant_config = transformers.BitsAndBytesConfig(
     load_in_4bit=True,
     bnb_4bit_quant_type='nf4',
     bnb_4bit_use_double_quant=True,
     bnb_4bit_compute_dtype=bfloat16
 )

 auth_token = 'YOUR_AUTHORIZATION_TOKEN'

 model_config = transformers.AutoConfig.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )

 model = transformers.AutoModelForCausalLM.from_pretrained(
     model_id,
     trust_remote_code=True,
     config=model_config,
     quantization_config=quant_config,
     use_auth_token=auth_token
 )

 model.eval()
 print(f"Model loaded on {device}")
The above code uses the fine-tuned chat model Llama-2-7b-chat-hf, and your access token to access the model.

5.Click the play button, or press CTRL + ENTER to execute the code.

6.Initialize the tokenizer.

tokenizer = transformers.AutoTokenizer.from_pretrained(
     model_id,
     use_auth_token=auth_token
 )
7.Initialize the pipeline.

pipe = transformers.pipeline(
     model=model,
     tokenizer=tokenizer,
     task='text-generation',
     temperature=0.0,
     max_new_tokens=50, 
     repetition_penalty=1.1
 )
8.Add a text prompt to the pipeline. Replace Hello World with your desired prompt.

 result = pipe('Hello World')[0]['generated_text']
 print(result)
In the chat model, the prompt you enter must be in a dialogue format to differentiate the responses between the base model and the fine-tuned version.

9.Verify the GPU usage statistics.

!nvidia-smi
Output:

+-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 |    0    0    0      36099      C   /usr/bin/python3                37666MiB |
 +-----------------------------------------------------------------------------+
As displayed in the above output, the Llama-2-70b-hf model uses up to 37.6 GB of VRAM when executed with 4-bit precision and quantization. The VRAM consumption of both the base model and fine-tuned models is similar because it's directly proportional to the parameter range of 70 billion.

Llama 2 Model Weights
Llama 2 parameters range from 7 billion to 70 billion, and each model has a fine-tuned chat version. Models with a low parameter range consume less GPU memory and can apply to testing inference on the model with fewer resources, but with a tradeoff on the output quality.

The following model options are available for Llama 2:

The above models are open-source and commercially licensed, you can use them for research and commercial purposes.

Llama 2 improvements over Llama 1
Common Declarations
Conclusion
In this article, you used Meta Llama 2 models on a Vultr Cloud GPU Server, and run the latest Llama 2 70b model together with its fine-tuned chat version in 4-bit mode. Below are the VRAM usage statistics for Llama 2 models with a 4-bit quantized configuration on an 80 GB RAM A100 Vultr GPU.

GPU Stats