How to train your data
The Vergecast
19 HOURS AGO
How to train your data
How to train your data

The Vergecast
19 HOURS AGO
Shownote
Shownote
Training data is the raw material of the AI industry. Claude, ChatGPT, Gemini, and the rest are built on top of oceans of stuff. What is that stuff? Books. Blog posts. YouTube videos. Reddit comments. All of it and more, in virtually incomprehensible quant...
Highlights
Highlights
This podcast delves into the hidden world of AI training data, the vast and often controversial collection of text, video, and audio that powers models like ChatGPT and Claude. Staff writer Alex Reisner joins to reveal how companies acquire this material, why they are so secretive about it, and the ethical dilemmas surrounding the use of public content without permission.
Chapters
Chapters
The Secret Fuel of AI: Why Training Data Matters More Than Code
00:00How to Uncover a Hidden Dataset: Reverse-Engineering AI's Secret Sauce
02:30From Academia to Profit: The Ethical Mess of 'Data Laundering'
05:17YouTube's Gold Rush: Why Every AI Company is Breaking the Rules
11:11The Synthetic Data Trap: Can AI Train Itself Without Collapsing?
17:20Transcript
Transcript
David Pierce: Hello and welcome to the Vergecast, the flagship podcast of music that sounds eerily, but not exactly like other music. I'm your friend David Pierce, and today on the show we're talking about training data. It's the raw materials of everythin...