Fine-tune Multi-modal Video + Text Models

with IDEFICS 2 Chatty

May 17, 2024

In this latest video, I describe how a (good) multi-modal image+text model can be used to describe and chat with videos.

In brief:
1️⃣ Set up an IDEFICS 2 endpoint
2️⃣ Clip your video into images (~1 second apart)
3️⃣ Send in the images + your text queries to the endpoint.

You can take things further - and improve video performance - by generating synthetic video datasets. Here's how to create a row of data:

1️⃣ Clip a video into images (~1 second apart)
2️⃣ Get description of each image (one at a time) from a multi-modal language model
3️⃣ Feed all image descriptions (in one prompt) into a strong language model and ask it to provide a "video description"
BONUS: Additionally feed in the video's captions - for added context.

With this type of dataset in hand, you can fine-tune a multi-modal model for your video applications.

That’s it for this week, Ronan

Ronan McGovern, Trelis Research

➡️ ADVANCED-vision Repo

➡️ One-click-llm repo

➡️ Trelis Resources/Support

Trelis Research Updates

Discussion about this post