Personalized Multimodal Understanding with RC-MLLM

RC-MLLM model is developed based on the Qwen2-VL model through a novel method called RCVIT (Region-level Context-aware Visual Instruction Tuning), using the specially constructed RCMU dataset for training. Its core feature is the capability for Region-level Context-aware Multimodal Understanding (RCMU). This means it can simultaneously understand both the visual content of specific regions/objects within an image and their associated textual information (utilizing bounding boxes coordinates), allowing it to respond to user instructions in a more context-aware manner. Simply put, RC-MLLM not only understands images but can also integrate the textual information linked to specific objects within the image for understanding. It achieves outstanding performance on RCMU tasks and is suitable for applications like personalized conversation.

📑 Region-Level Context-Aware Multimodal Understanding | 🤗 Models:RC-Qwen2VL-2b RC-Qwen2VL-7b| 📁 Dataset | Github | 🚀 Celebrity Recognition and VQA Demo

📌 First build a multimodal personalized knowledge base, then perform personalized multimodal understanding with RC-MLLM

1. Build Multimodal Personalized Knowledge Base
📖 Upload images, click on people or objects in the images and fill in their personalized information, then save them to create a multimodal personalized knowledge base

Object Image or Face Image (Select the type of image to upload)

📦object 👤face

Upload Images

Support multiple images per instance.

Different Instance Same Instance

Click on people or objects in the image to get a mask

Current Image

Mask

Input Personalized Information

Status

Multimodal Personalized Knowledge Base

Examples for information upload

Object Image or Face Image (Select the type of image to upload)	Current Image	Upload Images	Input Personalized Information

2. Personalized Multimodal Understanding with RC-MLLM
📖 Upload images and use the RC-MLLM model for personalized Q&A

Input Image

Object Detection Threshold

0 1

Face Detection Threshold

0 1.5

Question

Detection Result

Detection Information

RC-MLLM Answer

Examples for visual question answering

Input Image	Question

✅ RC-MLLM model loaded successfully

Personalized Multimodal Understanding with RC-MLLM

📌 First build a multimodal personalized knowledge base, then perform personalized multimodal understanding with RC-MLLM

1. Build Multimodal Personalized Knowledge Base📖 Upload images, click on people or objects in the images and fill in their personalized information, then save them to create a multimodal personalized knowledge base

2. Personalized Multimodal Understanding with RC-MLLM📖 Upload images and use the RC-MLLM model for personalized Q&A

1. Build Multimodal Personalized Knowledge Base
📖 Upload images, click on people or objects in the images and fill in their personalized information, then save them to create a multimodal personalized knowledge base

2. Personalized Multimodal Understanding with RC-MLLM
📖 Upload images and use the RC-MLLM model for personalized Q&A