scripod.com

A Technical History of Generative Media

Overview

Shownote

Highlights

Transcript

Chapters

Pins

A Technical History of Generative Media

Latent Space: The AI Engineer Podcast

2025/09/05

A Technical History of Generative Media

A Technical History of Generative Media

Latent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

2025/09/05

Overview Shownote Highlights Transcript Chapters Pins

In this episode, we hear from Gorkem and Batuhan of Fal.ai, a leading generative media inference platform that has rapidly scaled to serve 2 million developers, host 350 models, and achieve $100M ARR—recently backed by a $125M Series C. The conversation centers on their technical evolution, strategic pivots, and vision for the future of AI-generated images and video.

Fal.ai shifted from cloud Python runtimes to becoming a high-performance inference platform for generative media, catalyzed by Stable Diffusion 1.5’s open release and community-driven model adoption. Leveraging deep CUDA and compiler expertise, they built a flexible inference engine with over 100 custom kernels—delivering ~10x speedups over self-hosting across diverse GPUs. Latency emerged as critical for user engagement, especially in image and video workflows where streaming isn’t possible. Their infrastructure now spans six cloud providers and 24 data centers, managing 10,000+ H100s while optimizing for NVIDIA’s Blackwell architecture. Partnerships span open-source innovators like Black Forest Labs and closed-model developers including PlayHT and VO3, with emphasis on real-time TTS, lip sync, and draft-mode video. LoRA fine-tuning, ComfyUI workflows, and post-training of video models are accelerating enterprise adoption—especially in advertising and startup marketing. The team prioritizes developer experience, open ecosystems, and targeted engineering over chasing foundational model size, hiring engineers deeply embedded in open-source communities and building tools like serverless ComfyUI and File Workflows.

02:40

02:40

Model release days happen weekly and are the best part of the platform

04:58

04:58

VO3 created a usable text-to-video component

07:06

07:06

Chose not to compete in language models to avoid head-to-head rivalry with Google, OpenAI, and Anthropic

10:47

10:47

Optimizing Stable Diffusion 1.5 reduced inference time from 10 to 2 seconds

12:54

12:54

On average, a model on FAL runs 10x faster than self-hosting.

15:01

15:01

Image responses can't be streamed like language model responses

15:52

15:52

Latency is critical for generative media user experience

17:57

17:57

They package the inference engine so clients can self-service and get high performance without showing their code

18:47

18:47

Working with four major video companies and one undisclosed image company, which is sensitive for them

19:02

19:02

FAL can scale up to thousands of GPUs instantly

20:07

20:07

FAL and PlayHT achieved deep collaboration to optimize inference and infrastructure for real-time text-to-speech

21:29

21:29

They built their own orchestration layer, distributed file system, and container runtimes to ensure fast cold starts and handle scale

22:30

22:30

A team is working with NVIDIA to write custom Blackwell kernels for diffusion transformers to make it cost-effective

23:53

23:53

Building ASICs doesn't make sense for NVIDIA due to diverse diffusion workloads and the need for flexibility

25:02

25:02

Researchers prefer novel changes over iterative improvements like SDXL Lightning

26:10

26:10

A two-stage process—consistency models for drafting and real models for upscaling—improves image generation quality and control

27:40

27:40

Creators generate many videos at once and need to wait and iterate, so faster speeds are important

28:19

28:19

Anthropic's lack of an image generation model is due to its own priorities, not competitive disadvantage

29:50

29:50

Google mentioned 'generative media' in its last announcement, which is a win

30:16

30:16

Best-case scenarios for controllable video models from world models offer boundless possibilities in movies and games

33:59

33:59

Alibaba's updated video model runs draft mode in under five seconds and full 720p in 20 seconds

34:45

34:45

Using single-frame instead of multiple frames can yield a good text-image model due to video data

35:29

35:29

Training video models costs a couple million dollars but can bring a lot of attention, especially in a competitive LL space

36:44

36:44

Whether to make money from open-source models depends on a company's goals

38:04

38:04

The usage distribution of models follows a power-law but is not as extreme as expected and changes monthly

39:35

39:35

NSFW content is almost negligible, with moderation for illegal content and tracking of non-illegal NSFW content

40:48

40:48

Advertising, especially video advertising, is growing, while the claim of revolutionizing Hollywood filmmaking is considered less interesting

42:12

42:12

Generative technology is well-suited for advertising as it allows for unlimited ad creation and more personalized ads have greater economic value.

42:49

42:49

In 6–12 months, 80–90% of viral video content could be AI-generated

44:07

44:07

Only open-source models have a rich LoRA ecosystem

45:41

45:41

Training LoRA with 6–20 images for 1000 steps can achieve 99% accuracy

46:57

46:57

Many companies may focus on post-training open-source video models in the next six months to a year

47:18

47:18

As models improve, ComfyUI workflows for images are getting simpler, while those for video remain complex

49:28

49:28

Many startups are reinventing the wheel in AI data collection for image and video models

50:21

50:21

FAL could build an image dataset like Together AI did with RedPajama

52:31

52:31

State-of-the-art image models are cheap to train mainly due to data engineering, not algorithmic advances

53:34

53:34

VO3 can generalize and handle scenes, unlike post-trained models which are good for conversations but lack generalization ability

53:47

53:47

VO3 has the most accurate lip-sync compared to other models

55:11

55:11

Waiting for bigger models is a 'bitter lesson'

57:17

57:17

Those who can write a sparse attention kernel with BF16 on Blackwell should join FAL

58:23

58:23

The team has a high culture bar as the team loves generative media and would do it even if it weren't their job

58:44

58:44

Hired an Applied ML engineer with a top Hugging Face space and another specializing in training LORAs on FAL

59:30

59:30

A kernel bench is proposed to evaluate kernel stability and performance