Transformers from Scratch

OpenELM Disappoints; Transformers supports dequantizing

May 16, 2024

Transformers from Scratch

In last week’s livestream, I covered part 1 of building a transformer model from scratch. Later today (noon New York, 5 pm Irish, 9.30 pm Delhi) I’ll present part 2 - touching on loss calculation, pre-norm versus post-norm, dataset setup and RoPE vs sinusoidal embeddings.

Watch Live

OpenELM Disappoints

Apple released a series of 270M, 450M, 1.1B and 3B models called OpenELM. Unfortunately, the models are poorly supported (no chat template, flash attention, GGUF or vLLM support) and performance is poor - including after preference fine-tuning with ORPO:

OpenELM 270M is uniquely small, but weak.
OpenELM 450M improves a little over the 270M model, but remains weak on accuracy and hallucinates strongly.
Qwen 1.5 0.5B is stronger than the OpenELM model.
TinyLlama is stronger than OpenELM 1B.
Models like Phi-3 are stronger than OpenELM 3B.

It reminds me of George Hotz’ remarks on some of the AMD open sourcing of device drivers - he had a quote something like: “It’s not open source if you don’t provide the support/documentation.”

Here is a sample from inference of the ORPO fine-tuned 270M model:

<s> [INST] What are the planets in the solar system? [/INST]The planets in the solar system are the moons of the Earth, Mars, Jupiter, and Saturn. The planets are the smallest bodies in the solar system, and they are the only bodies that are not orbiting the Sun. The planets are the only bodies that are not directly in contact with the Sun, and they are the only bodies that are not in the habitable zone of the Sun.

The moons of the Earth, Mars, and Jupiter are the largest bodies in the solar system, and they are the only bodies that are directly in contact with the Sun. The moons are the only bodies that are not directly in contact with the Sun, and they are the only bodies that are not in the habitable zone of the Sun.

The planets are the only bodies that are not directly in contact with the Sun...

The 450M and 1.1B models are not much better - whether in Apple’s instruct form or with ORPO fine-tuning. I have made a Colab notebook here if you’d like to try out the models some more.

It is excellent that Apple has moved to open source their models. It should strongly be in Apple’s interest for there to be great tiny models, as this only increases the value of having a Mac (especially that Macs perform much better than Windows PCs when it comes to inference). Even if the models released aren’t the strongest, it would be to Apple’s benefit to allocate a small amount of time to having their models better documented - adding chat templating, flash attention, gguf and vLLM - is a low cost, high benefit for devs kind of thing.

De-quantizing LLMs

Until now, a major drawback of QLoRA was that the trained adapters could not easily be merged onto the base model without a significant loss of accuracy. The best options to date were:

a) Using Unsloth for fine-tuning (which does correctly allow for merging LoRA adapters with dequantized base weights).

b) Merging adapters to the quantized base weights (and keeping the model in 4-bit). This works, but inference is slow with nf4 weights compared to AWQ or GPTQ (which require 16 bit weights to generate those quants).

This issue has now finally been resolved with this PR. This means, correct QLoRA merging to 16 bits is possible with:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from peft import PeftModel

# assume you have already used QLoRA to train and push a lora_adapter to hub, the base model is model_id
model_id = "facebook/opt-125m"
lora_adapter = "your_org/your_lora_adapter_name"

# 
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=BitsAndBytesConfig(load_in_4bit=True))
tokenizer = AutoTokenizer.from_pretrained(model_id)

model.dequantize()

# Load an adapter
peft_model = PeftModel.from_pretrained(model,lora_adapter)

# Now you have a 16-bit merged model that you can inference or quantize with GPTQ or AWQ
model = peft_model.merge_and_unload()

For more on merging models, you can check this video.

Cheers, Ronan

Ronan McGovern, Trelis Research

➡️ ADVANCED-fine-tuning Repo (which now includes a branch for OpenELM fine-tuning with ORPO and LoRA).

➡️ Trelis Resources/Support

Trelis Research Updates

Discussion about this post