Fineweb - A Finely Cleaned Common Crawl Dataset

Jun 03, 2024

HuggingFace have released "Fineweb Edu" - a Common Crawl dataset that has been finely filtered - including with the help of Llama 3 70B.

It's likely the best open source dataset for pre-training!

I take the opportunity to review a full pipeline for cleaning web data, including:

➡️ Basic filtering (e.g. URL blacklists, language detection, quality heuristics)

➡️ De-duplicating data to avoid the model overfitting and overweighting repeated data. (De-duplicating within each web crawl (vs across all crawls) seems to work best.)

➡️ LLM assisted filtering, i.e. using a language model to score the educational quality of each text sample (scale of 0-5), and retaining only documents scoring 3 or higher. This boosted performance from the base Fineweb dataset to the Fineweb Edu version.

Cheers, Ronan

Trelis.com/About

Trelis Research Updates

Discussion about this post