🎥 Fine-Tuning Mistral’s Pixtral Open Source Multimodal Model
Mistral AI recently released Pixtral, an open source model that can process both text and images.
In this video, I take a look at:
- How Pixtral works
- A step-by-step guide to fine-tuning Pixtral on custom datasets
- Tips for getting the best performance when fine-tuning (learning rates, batch size)
- How to push your fine-tuned model to Hugging Face Hub
- Setting up inference using vLLM
A few key things to note:
- Pixtral combines Mistral Nemo with a custom vision encoder (trained from scratch!)
- It can handle images of any dimension thanks to a clever tokenization approach
To get fine-tuning to work, I had to:
1. Use a transformers version of the model
2. Set up a custom chat template
3. Set up a custom data collator with completions-only training.
Cheers, Ronan
Resources at Trelis.com/About