Speculative Decoding:
New public repo with a list of one-click templates:
Don’t hestitate to create an issue if there’s a specific one-click Runpod or Vast.AI template you’d like me to make:
Pushing 4-bit QLoRA models to hub
For a long time, it wasn’t possible to push 4-bit models to hub. A pull request has just been made allowing models to be merged and saved and pushed to hub. Let me know in the comments if you try it. I’ll cover it on my next fine-tuning video (whenever that is).
Quantizing on the fly with quality and speed
If there isn’t an AWQ model available, one approach is to use bitsandbytes-nf4 for quantization. The quality is good (probably better than AWQ) but the speed gets slow as batch size increases because there are a lot of calculation for dequantizing the weights inside the deep GPU.
A great alternative is —quantize eetq with text generation inference. It’s 8-bit, so it’s not as small as AWQ in VRAM, but there are optimised compute kernels that merge dequantization with multiplication operations, so there is less slowdown at larger batch sizes. Furthermore, quality is much less affected at 8-bit than 4-bit. Read more here.
Cheers, Ronan
Links: