Text only | Text with Images

Cobra Forum

Plesk Panel => Web Application => Topic started by: mahesh on Dec 28, 2023, 05:54 AM

Title: AI Image Captioning With BLIP-2 on Vultr Cloud GPU
Post by: mahesh on Dec 28, 2023, 05:54 AM

Question:
AI Image Captioning With BLIP-2 on Vultr Cloud GPU
(https://pix.cobrasoft.org/images/2023/12/28/Gj7SJZo.png)
Introduction
Bootstrapping Language-Image Pre-training (BLIP-2) is a pre-training framework that uses the development of trained vision models and large language models (LLMs) for zero-shot image-to-text generation. It delivers good results based on a wide range of vision-language tasks. BLIP-2 uses three models, an image encoder, a Querying Transformer (Q-Former), and a large language model which allow the model to perform tasks such as:

Image captioning
Visual question answering (VQA)
Chat-like conversations by retaining the previous conversation using prompts.

This article explains how to carry out AI Image Captioning With BLIP-2 on a Vultr Cloud GPU server. You are to use the BLIP-2 model to perform zero-shot image-to-text generation tasks using an imported image.

Prerequisites
Before you begin:

Deploy a Ubuntu A100 Cloud GPU server with at least:
1/3 GPU
20 GB VRAM.
3 vCPUs
30 GB Memory
Use SSH to access the server as a non-root user with sudo privileges.
Switch to the new user account

Code Select

  # su example_user

Set Up the Server
In this section, set up the server to run the BLIP-2 model with all necessary dependency packages as described in the steps below.

1.Install PyTorch

Code Select

 $ pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118

The above command installs PyTorch with pre-built CUDA 11.8 libraries. To use the latest version, refer to the PyTorch Documentation.

2.Install Jupyter Notebook

Code Select

 $ pip3 install notebook
[/code2.]By default, UFW is active on Vultr servers. Therefore, allow the Jupyter Notebook port 8888 through the firewall to accept connections

[code] $ sudo ufw allow 8888

3.Restart the firewall to apply changes

Code Select

 $ sudo ufw reload

4.Start Jupyter Notebook

Code Select

 $ jupyter notebook --ip=0.0.0.0

The above command starts a Jupyter Notebook instance that listens for incoming connections on all Server interfaces. If the command returns an error, exit your SSH session and start it again to activate the Jupyter library on your server.

6.Using a web browser such as Chrome, access Jupyter Notebook with the token generated in your command output

Code Select

 http://SERVER-IP:8888/tree?token=XXXXXX

7.Within the Jupyter Notebook interface, click New, select Notebook and create a Python3 Kernel file to start working on the model

(https://pix.cobrasoft.org/images/2023/12/28/Screenshot-110.png)
Set Up the Model
In this section, use Jupyter Notebook to import the required model libraries, load the pre-trained or fine-tuned BLIP-2 captioning model, and run it on the server as described in the steps below.

1.Install the salesforce-lavis package

Code Select

 !pip3 install salesforce-lavis

LAVIS is a Python deep learning library used for Language-and-Vision research and applications in tasks like retrieval, captioning, visual question answering, and multi-modal classification. It's used along with BLIP-2 for Visual Question Answering (VQA) related tasks.

2.Upgrade Jupyter Notebook and ipywidgets

Code Select

 !pip3 install --upgrade jupyter ipywidgets

3.Import the required libraries

Code Select

import torch
 from PIL import Image
 import requests
 from lavis.models import load_model_and_preprocess

Below is what the libraries do:

torch: It's used to build and train neural networks
Image: It's imported from PIL which provides image processing capabilities such as opening an image and image processing
requests: Downloads an image from a specified URL
load_model_and_preprocess: Loads a pre-trained image captioning model along with the reprocessing steps

4.Import the base image. Replace https://example.com/image.jpg with your actual image URL

Code Select

 img_url = 'https://example.com/image.jpg'

5.Process the image

Code Select

 raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

Below is what the function does:

Image.open(): Opens the raw image data. In the above code, it opens the raw image data from the HTTP response
requests.get(img_url, stream=True).raw: Uses the requests library to send an HTTP GET request to the URL specified in img_url. The stream=True argument ensures that the response gets streamed instead of getting downloaded in full. Streaming is useful when working with large files like images because it allows data download in chunks to conserve memory. The .raw attribute provides access to the raw content of the response, which is the image data.
convert('RGB'): Converts the image to a specified mode. In this case, 'RGB' (Red, Green, Blue) which is commonly used for image processing tasks applies to the image. This is necessary because the original input image can be in a different color mode, and converting it to RGB ensures consistent processing.

To view the generated RGB image, run the following command:

Code Select

 display(raw_image)

6.Move the computations to the GPU memory

Code Select

 device = torch.device("cuda")

The above command creates a torch.device object that represents the CUDA device. CUDA allows you to use NVIDIA GPUs to speed up computations in machine learning and other tasks.

7.Load the Pre-trained BLIP-2 Model

Code Select

 model, vis_processors, _ = load_model_and_preprocess(
  name="blip2_t5",
  model_type="caption_coco_flant5xl",
  is_eval=True,
  device=device
 )

Below is what the function does:

model: Holds the loaded pre-trained image captioning model
vis_processors: Holds the visualization processors. These processors are responsible for any pre-processing or post-processing steps required to visualize the results or outputs of the model
load_model_and_preprocess: Consists of several arguments to load the model and preprocess the data. These include:
name: Specifies a name or identifier for a specific model
model_type: Specifies the type or variant of the pre-trained model to load
is_eval: Verifies whether the model is in use for evaluation or not
device: Specifies the device on which the model loads and runs. In this case, cuda which means the model uses available GPU memory

Several other models can replace the above-used code block. These include the following:

pretrain_opt2.7b

Code Select

model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device
 )

pretrain_opt6.7b

Code Select

model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="pretrain_opt6.7b", is_eval=True, device=device
  )

caption_coco_opt2.7b

Code Select

model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="caption_coco_opt2.7b", is_eval=True, device=device
  )

caption_coco_opt6.7b

Code Select

 model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device
  )

pretrain_flant5xl

Code Select

 model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_t5", model_type="pretrain_flant5xl", is_eval=True, device=device
  )

You can use any of the above models to load and set up the model.

You do not need to run more than one model at a time to generate an output. Make sure only one model is in use at a time.

8.Load the processor

Code Select

 vis_processors.keys()

9.Prepare the image as input Using the associated processors

Code Select

 image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

Below is what the code does:

vis_processors["eval"]: vis_processors contains various processors, and "eval" refers to a processor designed specifically for evaluation.
unsqueeze(0): Adds a new dimension of size 1 to the tensor. In this case, it's converting the processed image tensor from a 3-dimensional tensor to a 4-dimensional tensor
to(device): Moves the tensor to the GPU for processing

When the above model steps are complete, the model is ready to generate a caption for any given image, visual question answering, and chat-like conversations.

Caption Generation
BLIP-2 allows two types of caption generation: Single Caption generation and Multiple Caption generation. In this section, generate captions on any given image as described in the steps below.

1.Single Caption: Generates one caption for an image. To view the single generated caption for the imported image, run the following code

Code Select

 model.generate({"image": image})

Below is what the code does:

model: Refers to the language model used to perform the image captioning task
{"image": image}: Refers to a dictionary where the key image corresponds to the input image to generate captions. The image variable contains the actual image data.

2.Multiple Caption: Generates multiple independent captions for an image

Code Select

 model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3)

Below is what the code does:

use_nucleus_sampling=True: Defines that the nucleus sampling applies in the generation process. Nucleus sampling is also known as top-k sampling. In nucleus sampling, the model considers a subset of the most likely words instead of sampling from a fixed number (k) of top options. This leads to more diverse text generation. To disable nucleus sampling, set the value to False
num_captions: Specifies the number of captions you intend to generate for the input image. It's set to 3 but you can change the number to generate more or less captions

Visual Question Answering (VQA)
Zero-Shot Vision-to-Language Generation refers to the ability of a model to generate captions or descriptions for images it has never seen during training. Therefore the model is capable of understanding the content of an image. In this section, ask the model a question based on the image as described below.

To ask a specific question, run the following command with a question prompt

Code Select

model.generate({
    "image": image, 
    "prompt": "Question: YOUR_QUESTION_HERE? Answer:"})

In the above code, "image": image specifies the input image that you want to generate text about. This produces the answer based on the asked question. Some prompt examples include, How many dogs are there in the picture?, Which city is this?, Where is this monument located?, among others

It's important to note that the model is fine-tuned on keywords like Question. This means that the generated descriptions or captions are more precise when they follow a prompt template. For example, declare a question using the Question: keyword to get more precise responses

Context Based Visual Question Answering for Chat-Like Conversations
The BLIP-2 model is capable of answering more than one question based on the same image by using the context of the previous image. To generate answers based on a specific context, run the following code:

Code Select

context = [
    ("PREVIOUS_QUESION1?", "PREVIOUS_ANSWER1"),
    ("PREVIOUS_QUESION2?", "PREVIOUS_ANSWER2"),
]

question = "NEW_QUESTION_HERE?"
template = "Question: {} Answer: {}."
prompt = " ".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + " Question: " + question + " Answer:"
print(prompt)

model.generate(
    {
    "image": image,
    "prompt": prompt
    },
    use_nucleus_sampling=False,
)

Below is what the code does:

context: Initializes a list where each element is a tuple consisting of a previous question and its corresponding answer. This context applies as a history of the conversation for the model to refer to. You can add more previous questions and answers depending on the total number of answers returned by the model. Below is an example of the context field:
Code Select Expand
context = [[/li] [li] ("which city is this?", "new york city"),[/li] [li] ("is it day or night?", "night"),[/li] [li] ][/li] [li]question: Sets up a new question to ask the model in the conversation. Replace NEW_QUESTION_HERE? with your actual question to ask
template: Defines a template string that's used to create the conversation history. The curly braces {} are placeholders filled by the previous question and answer pairs
prompt: Takes the entire prompt provided to the model, formats the context tuples and adds a new question to create a comprehensive view for generating an answer
model.generate: Generates an answer based on the provided prompt and image. The use_nucleus_sampling parameter value is False, this means nucleus sampling is not used for this generation process. You can set use_nucleus_sampling parameter to True to apply nucleus sampling in your output

Conclusion
In this article, you implemented AI image captioning with the BLIP-2 model on a Vultr Cloud GPU server. You prepared the server, installed libraries, and executed the model functions to generate output based on the input image. Additionally, you explored various use cases such as image captioning, VQA, and chat-like conversations based on context. For more information about the model, visit the BLIP-2 Hugging Face Space.

Text only | Text with Images

SMF 2.1.3 © 2022, Simple Machines