LIVESTREAM Note - This week’s livestream will be at 4 pm Irish time on Thursday 9th May 2024 (rather than the usual 4 pm). Get notified or drop a question in advance, right here. I’ll be covering how to build a transformer from scratch.
IDEFICS 2 and LLaVA Llama 3
This week, I go through the full process of fine-tuning IDEFICS 2 multimodal (text+image) model and deploying it for inference using TGI (Text Generation Inference) and RunPod.
Some key topics covered:
- Loading the model and a custom dataset
- Fine-tuning the model using LoRA (low-rank adaptation)
- Deploying an inference endpoint that can handle multiple image inputs
- Comparing performance to the LLaVA Llama 3 model
- VRAM requirements for training
IDEFICS 2 combines…
1) A vision encoder (SIGLIP type model)
2) A text model (7B parameter Mistral)
3) A multi-layer (rather than single layer) connector between the vision and text components
The model performs similar to models using a 34B text model, despite only using a 7B Mistral base. It also incorporates OCR pre-training to read text within images.
Fine-tuning and deploying IDEFICS 2 is made easier by its tight integration with the HuggingFace Transformers library and support for TGI serving. Another beautiful feature is the ability to inference or fine-tune on inputs with multiple images.
That’s it for this week, cheers, Ronan
Ronan McGovern, Trelis Research