Full fine-tuning with Fewer GPUs

Adafactor, 8-bit optimizers, GaLore and Subspace Descent

Trelis Research

Sep 17, 2024

In this vid, I cover techniques to train LLMs with fewer GPUs:

🔹 Optimizer tricks: Adam W, 8-bit Adam, Adafactor

🔹 Gradient techniques: Galore, subspace descent

🔹 Layer-wise updates for single GPU training

Rough findings (and it's setup specific):

✅ Adam W 8-bit: quality, ~30% memory savings

✅ Adafactor: Solid performance, ~45% less memory

✅ Galore + subspace descent: Good balance of quality/ low memory

✅ Layer-wise Galore: Lowest memory use, but slower

Whether you need to squeeze training into one GPU OR you have access to multiple-GPUs and still can't fit a large model, you might find some useful tips.

Cheers, Ronan

More Resources at Trelis.com/About

Trelis Research Updates

Discussion about this post