Fineweb - A Finely Cleaned Common Crawl Dataset
HuggingFace have released "Fineweb Edu" - a Common Crawl dataset that has been finely filtered - including with the help of Llama 3 70B.
It's likely the best open source dataset for pre-training!
I take the opportunity to review a full pipeline for cleaning web data, including:
➡️ Basic filtering (e.g. URL blacklists, language detection, quality heuristics)
➡️ De-duplicating data to avoid the model overfitting and overweighting repeated data. (De-duplicating within each web crawl (vs across all crawls) seems to work best.)
➡️ LLM assisted filtering, i.e. using a language model to score the educational quality of each text sample (scale of 0-5), and retaining only documents scoring 3 or higher. This boosted performance from the base Fineweb dataset to the Fineweb Edu version.
Cheers, Ronan