Much of what I explained about training and inference in the Pixtral video from Monday applies to Llama 3.2 Vision.
Performance vs. Pixtral (pure vibes)
OCR (reading text): Llama 3.2 seems better. This was a weakness for Pixtral.
Raw performance describing images: Similar (at least in recognizing my chess pieces).
Fine-tuning performance describing images: Similar to Pixtral (based on my chess piece fine-tuning).
Pixtral is very good when handling many images. It comments on each one separately. Meanwhile, Llama 3.2 Vision seems to struggle when describing four images (it may describe just one, or sometimes three).
Architecture
Llama seems to use a simple linear layer to connect the vision and language transformers (surprising, to say the least).
Pixtral uses an MLP (linear - activation function - linear).
Inference
vLLM works. I've added a one-click template to TrelisResearch/one-click-llms on Github.
To fit bf16 on an A40 (48 GB), you need to use
--enforce-eager
and limit the number of sequences (--max-num-seqs 8
).Right now, vLLM only supports passing in one image to Llama 3.2 Vision (while for Pixtral, you can pass in many).
Fine-tuning
Check out the video from earlier this week on Pixtral. Llama 3.2 Vision follows the same process.
Flash attention is not yet supported, so you need to either use bitsandbytes quantization—even on an A40 GPU with 48 GB—or switch to an A100 for fine-tuning.
Cheers,
Ronan
More resources at Trelis.com/About
P.S. For those who purchased lifetime access, I've uploaded a fine-tuning script specifically for Llama 3.2 Vision to Trelis.com/ADVANCED-vision.