scripod.com

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Shownote

In this conversation with Quentin Anthony, Head of Model Training at Zyphra and advisor at EleutherAI, we explore the cutting-edge world of building foundation models on AMD hardware and the future of edge AI deployment. Quentin shares his journey from working on Oak Ridge National Lab's Frontier supercomputer to leading Zyphra's ambitious move to AMD MI300X GPUs, where they're achieving performance that beats NVIDIA H100s on certain workloads while dramatically reducing costs. The discussion dives deep into the technical challenges of kernel development, with Quentin explaining why he often bypasses high-level frameworks like Triton to write directly in ROCm or even GPU assembly when necessary. He reveals how Zyphra's hybrid transformer-Mamba models like Zamba 2 can match Llama 3 8B performance at 7B parameters, optimized specifically for edge deployment across a spectrum from 1.2B models for phones to 7B for desktops. Quentin candidly discusses his experience in the controversial Menlo software engineering productivity study, where he was one of the few developers who showed measurable speedup from AI tools. He shares practical insights on avoiding the "slot machine effect" of endlessly prompting models, the importance of context rot awareness, and why he prefers direct API access over tools like Cursor to maintain complete control over model context. The conversation also covers the state of open source AI research, with Quentin arguing that siloed, focused teams with guaranteed funding produce better results than grand collaborative efforts. He explains why kernel datasets alone won't solve the GPU programming problem, the challenges of evaluating kernel quality, and why companies should invest more in ecosystem development rather than traditional marketing. https://www.linkedin.com/in/quentin-anthony/ https://www.zyphra.com/post/zamba2-7b Key Topics: • AMD MI300X advantages: 192GB VRAM, superior memory bandwidth • Writing kernels from PTX/AMD GCN assembly up through CUDA/ROCm • Hybrid attention-Mamba architectures and optimal sparsity ratios • The Menlo productivity study: achieving positive AI speedup • Context rot and why shorter conversations beat long threads • Why physicists make great ML engineers ("embryonic stem cells") • Edge deployment strategies from phones to local clusters • The future of on-device vs cloud inference routing • EleutherAI's focus on interpretability with fully open pipelines • Building velocity-focused teams over position-based hiring

Highlights

In this episode, we dive into the technical and strategic decisions behind building high-performance AI models on alternative hardware, as Quentin Anthony shares insights from his work at Zyphra and EleutherAI. From rethinking GPU ecosystems to pioneering edge-optimized architectures, the conversation reveals how low-level engineering choices are shaping the future of accessible, efficient AI deployment.
02:27
MI300X can beat H100 in non-FP8 dense transformers due to high VRAM and memory bandwidth
05:08
Instead of using Triton, we write kernels directly in ROCm and expose them via Torch.
12:29
Kernel datasets could improve model training but are not a silver bullet due to validation challenges.
19:38
Training inference-efficient models while considering future hardware compatibility.
26:51
High-quality models are preferred over faster local inference when there's a performance trade-off.
29:11
AI speeds up work only in specific cases with digital hygiene practices
45:23
Blindly trusting AI in development can lead to high-cost mistakes.
47:25
Physicists are the 'embryonic stem cells' of engineers due to their problem-solving adaptability

Chapters

How Zyphra beat H100s with AMD—and why nobody noticed
00:00
Why coding from scratch beats high-level tools in kernel development
05:08
What makes GPU programming so hard—and when you can’t avoid it
10:02
When model changes force custom kernels: the edge efficiency trade-off
17:12
From GPUs to ASICs: will specialized hardware dominate inference?
22:04
Did AI really make developers faster? Lessons from the Menlo study
29:11
How to use AI without losing control of your code or context
36:02
Hiring thinkers, not titles: building teams for speed and curiosity
47:25

Transcript

Alessio: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and we have back in the studio, Quentin Anthony from Zyphra and Eleuther AI. Welcome back. Quentin Anthony: Thanks for having me back, Alessio. It's grea...