Pixtral Fine-tuning

using transformers

Sep 23, 2024

🎥 Fine-Tuning Mistral’s Pixtral Open Source Multimodal Model

Mistral AI recently released Pixtral, an open source model that can process both text and images.

In this video, I take a look at:
- How Pixtral works
- A step-by-step guide to fine-tuning Pixtral on custom datasets
- Tips for getting the best performance when fine-tuning (learning rates, batch size)
- How to push your fine-tuned model to Hugging Face Hub
- Setting up inference using vLLM

A few key things to note:
- Pixtral combines Mistral Nemo with a custom vision encoder (trained from scratch!)
- It can handle images of any dimension thanks to a clever tokenization approach

To get fine-tuning to work, I had to:
1. Use a transformers version of the model
2. Set up a custom chat template
3. Set up a custom data collator with completions-only training.

Cheers, Ronan

Resources at Trelis.com/About

Trelis Research Updates

Discussion about this post