Train and Serve a Custom Multi-modal Model

incl. a review of IDEFICS 2 and LLaVA Llama 3.

May 06, 2024

LIVESTREAM Note - This week’s livestream will be at 4 pm Irish time on Thursday 9th May 2024 (rather than the usual 4 pm). Get notified or drop a question in advance, right here. I’ll be covering how to build a transformer from scratch.

IDEFICS 2 and LLaVA Llama 3

This week, I go through the full process of fine-tuning IDEFICS 2 multimodal (text+image) model and deploying it for inference using TGI (Text Generation Inference) and RunPod.

Some key topics covered:

- Loading the model and a custom dataset

- Fine-tuning the model using LoRA (low-rank adaptation)

- Deploying an inference endpoint that can handle multiple image inputs

- Comparing performance to the LLaVA Llama 3 model

- VRAM requirements for training

IDEFICS 2 combines…

1) A vision encoder (SIGLIP type model)

2) A text model (7B parameter Mistral)

3) A multi-layer (rather than single layer) connector between the vision and text components

The model performs similar to models using a 34B text model, despite only using a 7B Mistral base. It also incorporates OCR pre-training to read text within images.

Fine-tuning and deploying IDEFICS 2 is made easier by its tight integration with the HuggingFace Transformers library and support for TGI serving. Another beautiful feature is the ability to inference or fine-tune on inputs with multiple images.

That’s it for this week, cheers, Ronan

Ronan McGovern, Trelis Research

➡️ ADVANCED-vision Repo

➡️ Trelis Resources/Support

Trelis Research Updates

Train and Serve a Custom Multi-modal Model

incl. a review of IDEFICS 2 and LLaVA Llama 3.

IDEFICS 2 and LLaVA Llama 3