Trelis Livestream
On Thursday 25th April 2024 - from 5-6 pm Irish time - there’ll be a livestream on X (post Qs here) and on YouTube (post Qs here).
🎥 Fine-Tuning a Fast Multi-Modal Language Model in Under 10 Minutes 🎥
Here's how to fine-tune a small and fast multi-modal language model called Moondream - that allows for text and image inputs:
A few things I touch on:
✅ Moondream architecture overview
- Vision encoder (SIGLIP) to convert image patches to embeddings
- Adapter (MLP) to resize vision embeddings to match language model
- Language model (Phi 1.5) to predict next token based on text & image
✅ Applying LoRA adapters for efficient fine-tuning
- Adds trainable matrices to key linear layers
- Freezes base model weights to reduce VRAM needs
✅ Fine-tuning demo on chess piece dataset
- Adapting open source code for LoRA support
- Training for 3 epochs with scaled up learning rate on adapters
- Evaluating fine-tuned model on test images
✅ Deploying your fine-tuned model with a custom inference server
- Supports individual queries & batching
- Run locally or on hosted GPU with RunPod 1-click template
Moondream is a compact model under 2B params, but the performance is excellent, and it's pretty easy to fine-tune on a custom image dataset!
Cheers, Ronan
Ronan McGovern, Trelis Research
I did more research on it. From my initial testing Moondream is smaller in size and also gives better results than TinyGPT-V. I have not tried fine tuning though.
Have you had a look at https://github.com/DLYuanGod/TinyGPT-V?