In this vid, I cover techniques to train LLMs with fewer GPUs:
🔹 Optimizer tricks: Adam W, 8-bit Adam, Adafactor
🔹 Gradient techniques: Galore, subspace descent
🔹 Layer-wise updates for single GPU training
Rough findings (and it's setup specific):
✅ Adam W 8-bit: quality, ~30% memory savings
✅ Adafactor: Solid performance, ~45% less memory
✅ Galore + subspace descent: Good balance of quality/ low memory
✅ Layer-wise Galore: Lowest memory use, but slower
Whether you need to squeeze training into one GPU OR you have access to multiple-GPUs and still can't fit a large model, you might find some useful tips.
Cheers, Ronan
More Resources at Trelis.com/About