scripod.com

A Technical History of Generative Media

In this episode, we hear from Gorkem and Batuhan of Fal.ai, a leading generative media inference platform that has rapidly scaled to serve 2 million developers, host 350 models, and achieve $100M ARR—recently backed by a $125M Series C. The conversation centers on their technical evolution, strategic pivots, and vision for the future of AI-generated images and video.
Fal.ai shifted from cloud Python runtimes to becoming a high-performance inference platform for generative media, catalyzed by Stable Diffusion 1.5’s open release and community-driven model adoption. Leveraging deep CUDA and compiler expertise, they built a flexible inference engine with over 100 custom kernels—delivering ~10x speedups over self-hosting across diverse GPUs. Latency emerged as critical for user engagement, especially in image and video workflows where streaming isn’t possible. Their infrastructure now spans six cloud providers and 24 data centers, managing 10,000+ H100s while optimizing for NVIDIA’s Blackwell architecture. Partnerships span open-source innovators like Black Forest Labs and closed-model developers including PlayHT and VO3, with emphasis on real-time TTS, lip sync, and draft-mode video. LoRA fine-tuning, ComfyUI workflows, and post-training of video models are accelerating enterprise adoption—especially in advertising and startup marketing. The team prioritizes developer experience, open ecosystems, and targeted engineering over chasing foundational model size, hiring engineers deeply embedded in open-source communities and building tools like serverless ComfyUI and File Workflows.
02:40
02:40
Model release days happen weekly and are the best part of the platform
04:58
04:58
VO3 created a usable text-to-video component
07:06
07:06
Chose not to compete in language models to avoid head-to-head rivalry with Google, OpenAI, and Anthropic
10:47
10:47
Optimizing Stable Diffusion 1.5 reduced inference time from 10 to 2 seconds
12:54
12:54
On average, a model on FAL runs 10x faster than self-hosting.
15:01
15:01
Image responses can't be streamed like language model responses
15:52
15:52
Latency is critical for generative media user experience
17:57
17:57
They package the inference engine so clients can self-service and get high performance without showing their code
18:47
18:47
Working with four major video companies and one undisclosed image company, which is sensitive for them
19:02
19:02
FAL can scale up to thousands of GPUs instantly
20:07
20:07
FAL and PlayHT achieved deep collaboration to optimize inference and infrastructure for real-time text-to-speech
21:29
21:29
They built their own orchestration layer, distributed file system, and container runtimes to ensure fast cold starts and handle scale
22:30
22:30
A team is working with NVIDIA to write custom Blackwell kernels for diffusion transformers to make it cost-effective
23:53
23:53
Building ASICs doesn't make sense for NVIDIA due to diverse diffusion workloads and the need for flexibility
25:02
25:02
Researchers prefer novel changes over iterative improvements like SDXL Lightning
26:10
26:10
A two-stage process—consistency models for drafting and real models for upscaling—improves image generation quality and control
27:40
27:40
Creators generate many videos at once and need to wait and iterate, so faster speeds are important
28:19
28:19
Anthropic's lack of an image generation model is due to its own priorities, not competitive disadvantage
29:50
29:50
Google mentioned 'generative media' in its last announcement, which is a win
30:16
30:16
Best-case scenarios for controllable video models from world models offer boundless possibilities in movies and games
33:59
33:59
Alibaba's updated video model runs draft mode in under five seconds and full 720p in 20 seconds
34:45
34:45
Using single-frame instead of multiple frames can yield a good text-image model due to video data
35:29
35:29
Training video models costs a couple million dollars but can bring a lot of attention, especially in a competitive LL space
36:44
36:44
Whether to make money from open-source models depends on a company's goals
38:04
38:04
The usage distribution of models follows a power-law but is not as extreme as expected and changes monthly
39:35
39:35
NSFW content is almost negligible, with moderation for illegal content and tracking of non-illegal NSFW content
40:48
40:48
Advertising, especially video advertising, is growing, while the claim of revolutionizing Hollywood filmmaking is considered less interesting
42:12
42:12
Generative technology is well-suited for advertising as it allows for unlimited ad creation and more personalized ads have greater economic value.
42:49
42:49
In 6–12 months, 80–90% of viral video content could be AI-generated
44:07
44:07
Only open-source models have a rich LoRA ecosystem
45:41
45:41
Training LoRA with 6–20 images for 1000 steps can achieve 99% accuracy
46:57
46:57
Many companies may focus on post-training open-source video models in the next six months to a year
47:18
47:18
As models improve, ComfyUI workflows for images are getting simpler, while those for video remain complex
49:28
49:28
Many startups are reinventing the wheel in AI data collection for image and video models
50:21
50:21
FAL could build an image dataset like Together AI did with RedPajama
52:31
52:31
State-of-the-art image models are cheap to train mainly due to data engineering, not algorithmic advances
53:34
53:34
VO3 can generalize and handle scenes, unlike post-trained models which are good for conversations but lack generalization ability
53:47
53:47
VO3 has the most accurate lip-sync compared to other models
55:11
55:11
Waiting for bigger models is a 'bitter lesson'
57:17
57:17
Those who can write a sparse attention kernel with BF16 on Blackwell should join FAL
58:23
58:23
The team has a high culture bar as the team loves generative media and would do it even if it weren't their job
58:44
58:44
Hired an Applied ML engineer with a top Hugging Face space and another specializing in training LORAs on FAL
59:30
59:30
A kernel bench is proposed to evaluate kernel stability and performance