How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony
How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony
How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony
Shownote
Shownote
In this conversation with Quentin Anthony, Head of Model Training at Zyphra and advisor at EleutherAI, we explore the cutting-edge world of building foundation models on AMD hardware and the future of edge AI deployment. Quentin shares his journey from working on Oak Ridge National Lab's Frontier supercomputer to leading Zyphra's ambitious move to AMD MI300X GPUs, where they're achieving performance that beats NVIDIA H100s on certain workloads while dramatically reducing costs. The discussion dives deep into the technical challenges of kernel development, with Quentin explaining why he often bypasses high-level frameworks like Triton to write directly in ROCm or even GPU assembly when necessary. He reveals how Zyphra's hybrid transformer-Mamba models like Zamba 2 can match Llama 3 8B performance at 7B parameters, optimized specifically for edge deployment across a spectrum from 1.2B models for phones to 7B for desktops. Quentin candidly discusses his experience in the controversial Menlo software engineering productivity study, where he was one of the few developers who showed measurable speedup from AI tools. He shares practical insights on avoiding the "slot machine effect" of endlessly prompting models, the importance of context rot awareness, and why he prefers direct API access over tools like Cursor to maintain complete control over model context. The conversation also covers the state of open source AI research, with Quentin arguing that siloed, focused teams with guaranteed funding produce better results than grand collaborative efforts. He explains why kernel datasets alone won't solve the GPU programming problem, the challenges of evaluating kernel quality, and why companies should invest more in ecosystem development rather than traditional marketing.
https://www.linkedin.com/in/quentin-anthony/
https://www.zyphra.com/post/zamba2-7b
Key Topics: • AMD MI300X advantages: 192GB VRAM, superior memory bandwidth • Writing kernels from PTX/AMD GCN assembly up through CUDA/ROCm • Hybrid attention-Mamba architectures and optimal sparsity ratios • The Menlo productivity study: achieving positive AI speedup • Context rot and why shorter conversations beat long threads • Why physicists make great ML engineers ("embryonic stem cells") • Edge deployment strategies from phones to local clusters • The future of on-device vs cloud inference routing • EleutherAI's focus on interpretability with fully open pipelines • Building velocity-focused teams over position-based hiring
Highlights
Highlights
In this episode, we dive into the technical and strategic decisions behind building high-performance AI models on alternative hardware, as Quentin Anthony shares insights from his work at Zyphra and EleutherAI. From rethinking GPU ecosystems to pioneering edge-optimized architectures, the conversation reveals how low-level engineering choices are shaping the future of accessible, efficient AI deployment.
Chapters
Chapters
How Zyphra beat H100s with AMD—and why nobody noticed
00:00Why coding from scratch beats high-level tools in kernel development
05:08What makes GPU programming so hard—and when you can’t avoid it
10:02When model changes force custom kernels: the edge efficiency trade-off
17:12From GPUs to ASICs: will specialized hardware dominate inference?
22:04Did AI really make developers faster? Lessons from the Menlo study
29:11How to use AI without losing control of your code or context
36:02Hiring thinkers, not titles: building teams for speed and curiosity
47:25Transcript
Transcript
Alessio: Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, founder of Kernel Labs, and we have back in the studio, Quentin Anthony from Zyphra and Eleuther AI. Welcome back.
Quentin Anthony: Thanks for having me back, Alessio. It's grea...
