scripod.com

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

The latest evolution in computer vision has arrived with SAM 3, a model that transforms how machines understand visual content through natural language. In this discussion, leading researchers and practitioners explore how this technology achieves unprecedented precision and speed in identifying and tracking objects across images and video—without relying on traditional annotation methods.
SAM 3 marks a major leap in visual AI by enabling concept-based segmentation—using simple phrases like 'yellow school bus' to detect and track every instance across media. It operates in just 30ms per image on an H200 GPU and scales efficiently for real-time video. A new benchmark, SACO, captures over 200,000 unique concepts, far surpassing prior limits, while an AI-powered data engine slashes annotation time from minutes to seconds using Llama-fine-tuned verifiers. The model introduces key innovations: a presence token to separate object recognition from localization, and a decoupled detector-tracker design for accurate video identity preservation. Integrated with multimodal LLMs like Gemini, SAM 3 acts as a visual reasoning tool, solving tasks such as distinguishing gender or comparing sizes. Fine-tuning with minimal examples enables adaptation to domains like medical imaging or autonomous driving. Through Roboflow, over 106 million smart polygons have been generated, saving more than 130 years of manual labeling across critical applications including cancer research and environmental cleanup.
08:55
08:55
SAM 3 runs in 30ms per image on H200, enabling real-time performance
20:04
20:04
Just 3–5 negative examples significantly improve SAM3's fine-tuning performance.
29:37
29:37
LLMs correct SAM 3's errors and provide complex reasoning missing in vision-only models
51:42
51:42
SAM3 can correctly count fingers in an image where frontier multimodal models fail.
54:39
54:39
SAM 3 benefits from open-source community contributions and real-world testing
1:00:30
1:00:30
SAM 3 enables scalable video analysis for robotics using open text prompts.
1:11:45
1:11:45
Roboflow aims to be the top platform for building with SAM 3 and advancing AI applications.