scripod.com

Information Theory for Language Models: Jack Morris

Shownote

Our last AI PhD grad student feature was Shunyu Yao, who happened to focus on Language Agents for his thesis and immediately went to work on them for OpenAI. Our pick this year is Jack Morris, who bucks the “hot” trends by -not- working on agents, benchmarks, or VS Code forks, but is rather known for his work on the information theoretic understanding of LLMs, starting from embedding models and latent space representations (always close to our heart). Jack is an unusual combination of doing underrated research but somehow still being to explain them well to a mass audience, so we felt this was a good opportunity to do a different kind of episode going through the greatest hits of a high profile AI PhD, and relate them to questions from AI Engineering. Papers and References made AI grad school: https://x.com/jxmnop/status/1933884519557353716A new type of information theory: https://x.com/jxmnop/status/1904238408899101014EmbeddingsText Embeddings Reveal (Almost) As Much As Text: https://arxiv.org/abs/2310.06816Contextual document embeddings https://arxiv.org/abs/2410.02525Harnessing the Universal Geometry of Embeddings: https://arxiv.org/abs/2505.12540Language modelsGPT-style language models memorize 3.6 bits per param: https://x.com/jxmnop/status/1929903028372459909Approximating Language Model Training Data from Weights: https://arxiv.org/abs/2506.15553https://x.com/jxmnop/status/1936044666371146076LLM Inversion"There Are No New Ideas In AI.... Only New Datasets"https://x.com/jxmnop/status/1910087098570338756https://blog.jxmo.io/p/there-are-no-new-ideas-in-ai-onlymisc reference: https://junyanz.github.io/CycleGAN/ — for others hiring AI PhDs, Jack also wanted to shout out his coauthor Zach Nussbaum, his coauthor on Nomic Embed: Training a Reproducible Long Context Text Embedder.

Highlights

In this episode, we sit down with Jack Morris, a PhD student at Cornell Tech whose research focuses on the information-theoretic foundations of large language models. Unlike many of his peers who focus on trending topics like AI agents or benchmarking, Jack delves into the deeper mechanics of how models store and process information. His work spans embeddings, model inversion, and the surprising role of datasets in driving AI innovation. This conversation offers a unique window into some of the most underappreciated yet critical aspects of modern AI research.
10:54
Mojo is positioned as a faster alternative to CUDA, developed by Chris Lattner.
22:25
Training a model can measure its ability to store information from text.
27:49
Achieved 97% accuracy in recovering text from embeddings after iterative improvements.
47:34
Gemma 3n enables stackable and swappable capabilities in language models.
53:04
GPT-style models store around 3.6–3.9 bits of information per parameter.
1:06:49
In AI, paradigm shifts often come from training new techniques on new data, not just new methods.

Chapters

How did Jack Morris begin exploring AI, and what challenges shape today’s researchers?
00:00
What does information theory reveal about how language models store knowledge?
13:32
Can text be recovered from embeddings? The science behind embedding inversion
25:13
Do different models learn similar representations? Exploring embedding universality
38:53
How much data can a language model really remember?
53:04
Are new ideas in AI really new — or do breakthroughs come from better data?
1:03:57

Transcript

swyx: Hello, this is Latent Space, Jess Swix today with our special guest, Jack Morris. A guest from Columbia, that's your affiliation right now? Jack Morris: Cornell. It's actually confusing because I'm in the New York City, outpost of Cornell. So you ha...