Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is true. Datasets are somewhat cleaned, but only somewhat. When you have terabytes worth of text, there's only so much cleaning you can do economically.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: