UPDATE to Trelis/function-calling-v3
With the help of Zack Song, the function-calling-v3 dataset has been translated into Spanish and Chinese. All three languages (including English) are now available in the ‘multi-lingual’ branch to those who purchase access (or previously bought access).
This is a compact dataset that allows language models to be fine-tuned to achieve more consistent performance on function calling versus a zero shot or one shot approach. You can find out more on pre-trained function calling models, as well as tutorials for function-calling fine-tuning over on Trelis.com/function-calling .
Notes on Qwen1.5 Performance
Sagar Desai and I spent significant time working with the Qwen1.5 family of models with the goal of fine-tuning for function calling, ideally across multiple languages. This proved difficult and performance was inferior to the models that I find perform best for function calling, notably the openchat_3.5, Yi/SUSChat, Mixtral and DeepSeek models from the Trelis collection here.
For those of you digging in on Qwen, here are a few lessons learned:
AWQ tends to be unstable with Qwen1.5 models, both making AWQs and inferencing the original AWQ proved challenging. Testing with the bf16 weights (perhaps using eetq or bitsandbytes-nf4) resulted in more consistent performance.
When quantizing to GGUF, be sure to install sentencepiece first.
Tip: Check out Sagar’s blog on LLMs here
Zero shot function calling
I spend some time trying out zero-shot function calling with the Qwen1.5 series of models, which claim to support zero shot function calling. The 4B and 7B models incorrectly called functions when not required. The 14B model was able to call functions in a zero-shot manner, but not able to chain multiple function calls (i.e. when one function call is needed to get info for the next function call). Most likely the 72B model would perform even better, although it is quite unwieldy in size. Perhaps the best way to inference is using TGI and bitsandbytes-nf4, although that will be slow in terms of tokens per second generation speed.
I’ve made public a repo with a tokenizer containing a zero shot function calling chat template for Qwen here.
Gemma 7B
There's a one-click TGI template now for Google's Gemma 7B (really 9B) model in https://github.com/TrelisResearch/one-click-llms:
Includes fast weights downloading with hf_transfer
bf16 for fastest inference (will run on an A4000 or A6000 or larger)
ngram speculation for a speed-up
That’s it for this week, and many thanks to Zach and Sagar for their contributions, cheers, Ronan
Links:
➡️ Trelis Function-calling Models
➡️ One-click Fine-tuning & Inference Templates
➡️ Tip Jar