In this latest video, I describe how a (good) multi-modal image+text model can be used to describe and chat with videos.
In brief:
1️⃣ Set up an IDEFICS 2 endpoint
2️⃣ Clip your video into images (~1 second apart)
3️⃣ Send in the images + your text queries to the endpoint.
You can take things further - and improve video performance - by generating synthetic video datasets. Here's how to create a row of data:
1️⃣ Clip a video into images (~1 second apart)
2️⃣ Get description of each image (one at a time) from a multi-modal language model
3️⃣ Feed all image descriptions (in one prompt) into a strong language model and ask it to provide a "video description"
BONUS: Additionally feed in the video's captions - for added context.
With this type of dataset in hand, you can fine-tune a multi-modal model for your video applications.
That’s it for this week, Ronan
Ronan McGovern, Trelis Research