I'd use a model trained on a targeted and curated data set over one trained on a...

loandbehold · 2025-10-28T19:27:55 1761679675

I keep hearing that LLMs are trained on "Internet crap" but is it true? For instance we know from Anthropic copyright case that they scanned millions of books to make a training set. They certainly use Internet content for training but I'm sure it's curated to a large degree. They don't just scrap random pages and feed into LLM.

airspresso · 2025-10-28T22:55:00 1761692100

> I keep hearing that LLMs are trained on "Internet crap" but is it true?

Karpathy repeated this in a recent interview [0], that if you'd look at random samples in the pretraining set you'd mostly see a lot of garbage text. And that it's very surprising it works at all.

The labs have focused a lot more on finetuning (posttraining) and RL lately, and from my understanding that's where all the desirable properties of an LLM are trained into it. Pretraining just teaches the LLM the semantic relations it needs as the foundation for finetuning to work.

[0]: https://www.dwarkesh.com/p/andrej-karpathy

ACCount37 · 2025-10-29T11:48:49 1761738529

"Just" is the wrong way to put it.

Pretraining teaches LLMs everything. SFT and RL is about putting that "everything" into useful configurations and gluing it together so that it works better.

ACCount37 · 2025-10-29T11:41:57 1761738117

It is true. Datasets are somewhat cleaned, but only somewhat. When you have terabytes worth of text, there's only so much cleaning you can do economically.

nutjob2 · 2025-10-28T20:56:05 1761684965

> I'm sure it's curated to a large degree. They don't just scrap random pages and feed into LLM.

How would they curate it on that scale? Does page ranking (popularity) produce interesting pages for this purpose? I'm skeptical.