Distillation of Transformer Models

Distilling a 99M parameter model step by step

Sep 25, 2024

🔬 Distilling Smaller, Faster AI Models 🔬

I've a new video out that looks at model distillation - a technique for creating smaller, faster AI models from larger ones.

Special thanks to Rohan Sharma, who helped to build the scripts on a Trelis Internship. Thanks also to Elie Bakouch of HuggingFace for help with data preparation.

➡️ Distillation allows you to create models that are:

1. Faster to run

2. Cheaper to deploy

3. Require on ~2% of the original data to train

➡️ It's widely used for:

- Language models (e.g. GPT-4o to GPT-4o Mini... perhaps)

- Image diffusion

- Speech transcription

➡️ The distillation process involves:

- Pruning - Strategically removing layers/weights

- Knowledge distillation - Training on teacher model outputs

- Instruction fine-tuning - Polishing for downstream tasks

➡️ Tips:

- Layer pruning is simplest, but width pruning can give better results

- Monitor gradient norms closely during training

- Instruction fine-tuning is crucial for final performance

I walk through the full code and process for distilling a 99M parameter model in the video.

…and yeah, I need to better train my diffusion model lora to stop it aging me 🤣🤣

That’s it for this week, cheers, Ronan

More resources at Trelis.com/About

Trelis Research Updates

Discussion about this post