Tiny Text + Vision Models

Fine-tuning + API Setup

Apr 22, 2024

Trelis Livestream

On Thursday 25th April 2024 - from 5-6 pm Irish time - there’ll be a livestream on X (post Qs here) and on YouTube (post Qs here).

🎥 Fine-Tuning a Fast Multi-Modal Language Model in Under 10 Minutes 🎥

Here's how to fine-tune a small and fast multi-modal language model called Moondream - that allows for text and image inputs:

A few things I touch on:

✅ Moondream architecture overview

- Vision encoder (SIGLIP) to convert image patches to embeddings

- Adapter (MLP) to resize vision embeddings to match language model

- Language model (Phi 1.5) to predict next token based on text & image

✅ Applying LoRA adapters for efficient fine-tuning

- Adds trainable matrices to key linear layers

- Freezes base model weights to reduce VRAM needs

✅ Fine-tuning demo on chess piece dataset

- Adapting open source code for LoRA support

- Training for 3 epochs with scaled up learning rate on adapters

- Evaluating fine-tuned model on test images

✅ Deploying your fine-tuned model with a custom inference server

- Supports individual queries & batching

- Run locally or on hosted GPU with RunPod 1-click template

Moondream is a compact model under 2B params, but the performance is excellent, and it's pretty easy to fine-tune on a custom image dataset!

Cheers, Ronan

Ronan McGovern, Trelis Research

➡️ ADVANCED-vision Repo

➡️ Trelis Resources/Support

3 Comments

Andrew Walsh

May 9Liked by Trelis Research

I did more research on it. From my initial testing Moondream is smaller in size and also gives better results than TinyGPT-V. I have not tried fine tuning though.

Expand full comment

1 reply by Trelis Research

May 8Liked by Trelis Research

Have you had a look at https://github.com/DLYuanGod/TinyGPT-V?

1 more comment...

Trelis Research Updates

Tiny Text + Vision Models

Fine-tuning + API Setup

Trelis Livestream

🎥 Fine-Tuning a Fast Multi-Modal Language Model in Under 10 Minutes 🎥