🌟 Simultaneous Preference Optimisation + Supervised Fine-tuning 🌟
The traditional way to train a model from scratch is:
Unsupervised fine-tuning -> Supervised Fine-Tuning -> Reinforcement Learning
... and the reinforcement learning part has always been tricky to get right - sometimes even worsening model performance.
In my latest video, I dig into how ORPO can help you combine preference fine-tuning (a simpler form of reinforcement learning) with supervised fine-tuning - all in one step. By considering not just the probability of the next token, but the ratio of probabilities between chosen (good) and rejected (bad) responses, ORPO shifts your model towards producing higher quality outputs.
⏱️ ORPO saves training steps and time because it allows for SFT and reinforcement learning at the same time.
📈 On benchmarks like MMLU, models I fine-tuned with ORPO outperform those trained with supervised fine-tuning alone.
🧑💻 To tune with ORPO you'll need a dataset with both chosen and rejected responses for each prompt. I used a sample dataset of just ~7,000 samples.
Check out the full video where I explain the maths behind ORPO, demo the fine-tuning process, and share detailed results.
Cheers, Ronan
Ronan McGovern, Trelis Research
➡️ ADVANCED-fine-tuning Repo (and individual ORPO scripts)
➡️ One-click Fine-tuning, Evaluation and API Templates
➡️ ADVANCED-transcription Repo