scripod.com

How to train your data

Overview

Shownote

Highlights

Transcript

Chapters

Pins

How to train your data

The Vergecast

19 HOURS AGO

How to train your data

How to train your data

The Vergecast

The Vergecast

19 HOURS AGO

Overview Shownote Highlights Transcript Chapters Pins

Shownote

Training data is the raw material of the AI industry. Claude, ChatGPT, Gemini, and the rest are built on top of oceans of stuff. What is that stuff? Books. Blog posts. YouTube videos. Reddit comments. All of it and more, in virtually incomprehensible quant...

Highlights

This podcast delves into the hidden world of AI training data, the vast and often controversial collection of text, video, and audio that powers models like ChatGPT and Claude. Staff writer Alex Reisner joins to reveal how companies acquire this material, why they are so secretive about it, and the ethical dilemmas surrounding the use of public content without permission.

00:02

AI training data sources and impact discussed

02:30

Data determines a model's capabilities more than its architecture

05:17

Data secrecy is a competitive advantage.

14:25

AI companies use a 'data laundering network' to scrape data.

20:31

Data is the new oil, but nobody owns the well.

Chapters

The Secret Fuel of AI: Why Training Data Matters More Than Code

00:00

How to Uncover a Hidden Dataset: Reverse-Engineering AI's Secret Sauce

02:30

From Academia to Profit: The Ethical Mess of 'Data Laundering'

05:17

YouTube's Gold Rush: Why Every AI Company is Breaking the Rules

11:11

The Synthetic Data Trap: Can AI Train Itself Without Collapsing?

17:20

Transcript

David Pierce: Hello and welcome to the Vergecast, the flagship podcast of music that sounds eerily, but not exactly like other music. I'm your friend David Pierce, and today on the show we're talking about training data. It's the raw materials of everythin...