HomePostsProjects

Common Crawl web data

29-01-2024

Chatted with an AI-vendor last week and they mentioned this public dataset, hadn't heard of it before but apparently underpins a lot of the public training data that goes into foundation models - https://commoncrawl.org/ + https://commoncrawl.org/get-started

© 2025 Luke Miloszewski

Email AddressGitHubTwitter