🔬 Distilling Smaller, Faster AI Models 🔬
I've a new video out that looks at model distillation - a technique for creating smaller, faster AI models from larger ones.
Special thanks to Rohan Sharma, who helped to build the scripts on a Trelis Internship. Thanks also to Elie Bakouch of HuggingFace for help with data preparation.
➡️ Distillation allows you to create models that are:
1. Faster to run
2. Cheaper to deploy
3. Require on ~2% of the original data to train
➡️ It's widely used for:
- Language models (e.g. GPT-4o to GPT-4o Mini... perhaps)
- Image diffusion
- Speech transcription
➡️ The distillation process involves:
- Pruning - Strategically removing layers/weights
- Knowledge distillation - Training on teacher model outputs
- Instruction fine-tuning - Polishing for downstream tasks
➡️ Tips:
- Layer pruning is simplest, but width pruning can give better results
- Monitor gradient norms closely during training
- Instruction fine-tuning is crucial for final performance
I walk through the full code and process for distilling a 99M parameter model in the video.
…and yeah, I need to better train my diffusion model lora to stop it aging me 🤣🤣
That’s it for this week, cheers, Ronan
More resources at Trelis.com/About