In this latest video, I train LLaMA 3 on the Irish Language - and illustrate how to build datasets from Wikipedia and use them for fine-tuning. Here’s what I did:
🎵 Preparing the Dataset
- Extracted articles from an Irish Wikipedia dump
- Processed the data into a clean JSON format
- Pushed the dataset to Hugging Face Hub
🎵 Fine-Tuning Setup
- Blended the Irish dataset with a small amount of chat data to avoid catastrophic forgetting
- Used LoRA with rank 128
- Made the embedding layer and LM head trainable
- Trained for 1 epoch with a constant learning rate, followed by a short annealing period
🎵 Evaluation
- Prepared a custom dataset of questions and answers for manual inspection
- Observed improvements in the model's ability to respond in Irish, although some inconsistencies remained
- Tracked numeric improvements using a validation dataset
🎵 Potential Improvements
- Increase the training sequence length to fully utilize the model's context
- Expand the dataset by incorporating additional sources (e.g., books, Common Crawl)
- Train for more epochs to further reduce the evaluation loss
- Apply supervised fine-tuning or ORPO on top of the unsupervised fine-tuning
Feel free to comment or write back with any questions, Ronan
Ronan McGovern, Trelis Research