scripod.com

How to train your data

The Vergecast

19 HOURS AGO
The Vergecast

The Vergecast

19 HOURS AGO

Shownote

Training data is the raw material of the AI industry. Claude, ChatGPT, Gemini, and the rest are built on top of oceans of stuff. What is that stuff? Books. Blog posts. YouTube videos. Reddit comments. All of it and more, in virtually incomprehensible quant...

Highlights

This podcast delves into the hidden world of AI training data, the vast and often controversial collection of text, video, and audio that powers models like ChatGPT and Claude. Staff writer Alex Reisner joins to reveal how companies acquire this material, why they are so secretive about it, and the ethical dilemmas surrounding the use of public content without permission.
00:02
AI training data sources and impact discussed
02:30
Data determines a model's capabilities more than its architecture
05:17
Data secrecy is a competitive advantage.
14:25
AI companies use a 'data laundering network' to scrape data.
20:31
Data is the new oil, but nobody owns the well.

Chapters

The Secret Fuel of AI: Why Training Data Matters More Than Code
00:00
How to Uncover a Hidden Dataset: Reverse-Engineering AI's Secret Sauce
02:30
From Academia to Profit: The Ethical Mess of 'Data Laundering'
05:17
YouTube's Gold Rush: Why Every AI Company is Breaking the Rules
11:11
The Synthetic Data Trap: Can AI Train Itself Without Collapsing?
17:20

Transcript

David Pierce: Hello and welcome to the Vergecast, the flagship podcast of music that sounds eerily, but not exactly like other music. I'm your friend David Pierce, and today on the show we're talking about training data. It's the raw materials of everythin...