Fine-tuning on Custom Wikipedia Datasets

Fine-tuning LLaMA 3 for the Irish Language

May 22, 2024

In this latest video, I train LLaMA 3 on the Irish Language - and illustrate how to build datasets from Wikipedia and use them for fine-tuning. Here’s what I did:

🎵 Preparing the Dataset

- Extracted articles from an Irish Wikipedia dump

- Processed the data into a clean JSON format

- Pushed the dataset to Hugging Face Hub

🎵 Fine-Tuning Setup

- Blended the Irish dataset with a small amount of chat data to avoid catastrophic forgetting

- Used LoRA with rank 128

- Made the embedding layer and LM head trainable

- Trained for 1 epoch with a constant learning rate, followed by a short annealing period

🎵 Evaluation

- Prepared a custom dataset of questions and answers for manual inspection

- Observed improvements in the model's ability to respond in Irish, although some inconsistencies remained

- Tracked numeric improvements using a validation dataset

🎵 Potential Improvements

- Increase the training sequence length to fully utilize the model's context

- Expand the dataset by incorporating additional sources (e.g., books, Common Crawl)

- Train for more epochs to further reduce the evaluation loss

- Apply supervised fine-tuning or ORPO on top of the unsupervised fine-tuning

Feel free to comment or write back with any questions, Ronan

Ronan McGovern, Trelis Research

➡️ ADVANCED-fine-tuning Repo

➡️ Trelis Resources/Support

Trelis Research Updates

Fine-tuning on Custom Wikipedia Datasets

Fine-tuning LLaMA 3 for the Irish Language