Cobra Forum

Plesk Panel => Web Application => Topic started by: mahesh on Dec 28, 2023, 05:54 AM

Title: AI Image Captioning With BLIP-2 on Vultr Cloud GPU
Post by: mahesh on Dec 28, 2023, 05:54 AM
Question:
AI Image Captioning With BLIP-2 on Vultr Cloud GPU
(https://pix.cobrasoft.org/images/2023/12/28/Gj7SJZo.png)
Introduction
Bootstrapping Language-Image Pre-training (BLIP-2) is a pre-training framework that uses the development of trained vision models and large language models (LLMs) for zero-shot image-to-text generation. It delivers good results based on a wide range of vision-language tasks. BLIP-2 uses three models, an image encoder, a Querying Transformer (Q-Former), and a large language model which allow the model to perform tasks such as:

This article explains how to carry out AI Image Captioning With BLIP-2 on a Vultr Cloud GPU server. You are to use the BLIP-2 model to perform zero-shot image-to-text generation tasks using an imported image.

Prerequisites
Before you begin:


  # su example_user
Set Up the Server
In this section, set up the server to run the BLIP-2 model with all necessary dependency packages as described in the steps below.

1.Install PyTorch

$ pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118
The above command installs PyTorch with pre-built CUDA 11.8 libraries. To use the latest version, refer to the PyTorch Documentation.

2.Install Jupyter Notebook

$ pip3 install notebook
[/code2.]By default, UFW is active on Vultr servers. Therefore, allow the Jupyter Notebook port 8888 through the firewall to accept connections

[code] $ sudo ufw allow 8888
3.Restart the firewall to apply changes

$ sudo ufw reload
4.Start Jupyter Notebook

$ jupyter notebook --ip=0.0.0.0
The above command starts a Jupyter Notebook instance that listens for incoming connections on all Server interfaces. If the command returns an error, exit your SSH session and start it again to activate the Jupyter library on your server.

6.Using a web browser such as Chrome, access Jupyter Notebook with the token generated in your command output

http://SERVER-IP:8888/tree?token=XXXXXX
7.Within the Jupyter Notebook interface, click New, select Notebook and create a Python3 Kernel file to start working on the model

(https://pix.cobrasoft.org/images/2023/12/28/Screenshot-110.png)
Set Up the Model
In this section, use Jupyter Notebook to import the required model libraries, load the pre-trained or fine-tuned BLIP-2 captioning model, and run it on the server as described in the steps below.

1.Install the salesforce-lavis package

!pip3 install salesforce-lavis
LAVIS is a Python deep learning library used for Language-and-Vision research and applications in tasks like retrieval, captioning, visual question answering, and multi-modal classification. It's used along with BLIP-2 for Visual Question Answering (VQA) related tasks.

2.Upgrade Jupyter Notebook and ipywidgets

!pip3 install --upgrade jupyter ipywidgets
3.Import the required libraries

 import torch
 from PIL import Image
 import requests
 from lavis.models import load_model_and_preprocess
Below is what the libraries do:

4.Import the base image. Replace https://example.com/image.jpg with your actual image URL

img_url = 'https://example.com/image.jpg'
5.Process the image

raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
Below is what the function does:

To view the generated RGB image, run the following command:

display(raw_image)
6.Move the computations to the GPU memory

device = torch.device("cuda")
The above command creates a torch.device object that represents the CUDA device. CUDA allows you to use NVIDIA GPUs to speed up computations in machine learning and other tasks.

7.Load the Pre-trained BLIP-2 Model

model, vis_processors, _ = load_model_and_preprocess(
  name="blip2_t5",
  model_type="caption_coco_flant5xl",
  is_eval=True,
  device=device
 )
Below is what the function does:

Several other models can replace the above-used code block. These include the following:

pretrain_opt2.7b

 model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="pretrain_opt2.7b", is_eval=True, device=device
 )
pretrain_opt6.7b

 model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="pretrain_opt6.7b", is_eval=True, device=device
  )
caption_coco_opt2.7b

 model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="caption_coco_opt2.7b", is_eval=True, device=device
  )
caption_coco_opt6.7b

model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device
  )
pretrain_flant5xl

model, vis_processors, _ = load_model_and_preprocess(
      name="blip2_t5", model_type="pretrain_flant5xl", is_eval=True, device=device
  )
You can use any of the above models to load and set up the model.

You do not need to run more than one model at a time to generate an output. Make sure only one model is in use at a time.

8.Load the processor

vis_processors.keys()
9.Prepare the image as input Using the associated processors

image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
Below is what the code does:

When the above model steps are complete, the model is ready to generate a caption for any given image, visual question answering, and chat-like conversations.

Caption Generation
BLIP-2 allows two types of caption generation: Single Caption generation and Multiple Caption generation. In this section, generate captions on any given image as described in the steps below.

1.Single Caption: Generates one caption for an image. To view the single generated caption for the imported image, run the following code

model.generate({"image": image})
Below is what the code does:

2.Multiple Caption: Generates multiple independent captions for an image

model.generate({"image": image}, use_nucleus_sampling=True, num_captions=3)
Below is what the code does:

Visual Question Answering (VQA)
Zero-Shot Vision-to-Language Generation refers to the ability of a model to generate captions or descriptions for images it has never seen during training. Therefore the model is capable of understanding the content of an image. In this section, ask the model a question based on the image as described below.

To ask a specific question, run the following command with a question prompt

model.generate({
    "image": image,
    "prompt": "Question: YOUR_QUESTION_HERE? Answer:"})
In the above code, "image": image specifies the input image that you want to generate text about. This produces the answer based on the asked question. Some prompt examples include, How many dogs are there in the picture?, Which city is this?, Where is this monument located?, among others

It's important to note that the model is fine-tuned on keywords like Question. This means that the generated descriptions or captions are more precise when they follow a prompt template. For example, declare a question using the Question: keyword to get more precise responses

Context Based Visual Question Answering for Chat-Like Conversations
The BLIP-2 model is capable of answering more than one question based on the same image by using the context of the previous image. To generate answers based on a specific context, run the following code:

context = [
    ("PREVIOUS_QUESION1?", "PREVIOUS_ANSWER1"),
    ("PREVIOUS_QUESION2?", "PREVIOUS_ANSWER2"),
]

question = "NEW_QUESTION_HERE?"
template = "Question: {} Answer: {}."
prompt = " ".join([template.format(context[i][0], context[i][1]) for i in range(len(context))]) + " Question: " + question + " Answer:"
print(prompt)

model.generate(
    {
    "image": image,
    "prompt": prompt
    },
    use_nucleus_sampling=False,
)
Below is what the code does:

Conclusion
In this article, you implemented AI image captioning with the BLIP-2 model on a Vultr Cloud GPU server. You prepared the server, installed libraries, and executed the model functions to generate output based on the input image. Additionally, you explored various use cases such as image captioning, VQA, and chat-like conversations based on context. For more information about the model, visit the BLIP-2 Hugging Face Space.