The Physical AI Data Bottleneck Has Shifted
- Nexus Data Strategy

- Jun 11
- 4 min read

Spotted in the wild on the streets of New York during NY Tech Week. The robots are already here.
Spent last week across four physical AI events at NY Tech Week. A few things that stayed with me.
The pre-training data problem is largely being solved. Synthetic data and egocentric video collected at scale, predominantly out of India right now, is abundant and becoming commoditized fast. The barrier to entry is low and too many players are chasing the same layer.

Volumes, 3D and 4D Data: World Models, Robotics and Physical AI, NY Tech Week, June 2026
The post-training and mid-training data problem is not being solved. This is where the real bottleneck is shifting. Getting a robot from the lab to a live production environment requires a fundamentally different quality of data: high-precision, synchronized, multimodal, task-specific. NVIDIA's EgoScale research makes the mechanics precise. The mid-training layer is small in volume but extremely high in precision, and it does not yet exist at scale.
The From Prototype to Production panel made this tangible. Move a robot across the street and you are in a fundamentally different environment. The unknown unknowns multiply the moment you leave your training conditions. Getting to 90 percent in the lab is achievable. Getting to the 99 percent that production deployment actually requires is a different problem entirely. And until you solve the post-training layer, the deployment data flywheel, where robots in production generate their own training signal at scale, cannot be unlocked.
High dexterity hand and manipulation data is the most acute expression of this gap. The corpus simply does not exist at the required precision.

From Prototype to Production: The Reality of Deploying Robotics and Embodied AI, hosted by New York Robotics, NY Tech Week, June 2026. Evan Beard (CEO, Standard Bots), Pim de Witte (CEO, General Intuition), Oliver Ortlieb (CTO, Ultra), and Josh Merel (CTO, Fauna Robotics). Curated and led by Ilir Aliu, founder of 22Astronauts.
The data gap is one of several structural problems slowing physical AI deployment in the US. The Humans, Robots and The Factory Floor panel named the others: CapEx requirements, cultural resistance, a shortage of trained integrators, absent national standards. China is producing roughly ten times the robots the US is right now. These problems compound each other.

Humans, Robots and The Factory Floor, hosted by New York Robotics and the American Manufacturing Futures Institute, NY Tech Week, June 2026. Panelists: Michael Perry, Persona AI; Zach Tomkinson, Standard Bots; Paul Lavoie, University of New Haven. Moderator: Stacey Weismiller, AMFI.
I have watched versions of this problem from both sides.
On the data supply side, in alternative data for hedge funds, the market went through a full cycle. Scramble, then oversupply, then a focus on proprietary and enriched datasets, then consolidation as prices compressed and the market could not support the sheer number of players. The ones who survived were the ones with genuinely unique, high-quality data.
On the buyer side at Opendoor, the problem looked different but the underlying dynamic was identical. Getting to 90 percent accuracy on home pricing was achievable with large volume, cheap MLS data. The long tail was not. A shower installed in the kitchen. A neighbor nobody wants to live next to. Noise pollution from adjacent infrastructure. A smell. Proximity to something that quietly kills a home's value. These edge cases are things you cannot simulate and cannot anticipate until you encounter them in the real world. With Opendoor, the stakes were high, we were assuming ownership of a home. With robotics the stakes are much higher, as they interact autonomously with real people.
Physical AI faces exactly the same problem. General egocentric data handles the known distribution reasonably well. What it cannot capture is physical correctness. Visual plausibility is not the same thing as physical reality. A model trained on video knows what a grip looks like. It does not know the mass of the object, the friction of the surface, the pressure required, or how those properties change across the long tail of real-world situations. Those properties only reveal themselves in deployment, in the specific domain, with the specific objects, under real operating conditions. That is the post-training data problem in one sentence.

Tyler Raciti, Co-Founder, Volumes. 3D and 4D Data: World Models, Robotics and Physical AI, NY Tech Week, June 2026. The bottleneck is shifting from pre-training volume toward post-training precision. The players who build supply infrastructure for that layer now will have a structural advantage as the market matures.
That is where Nexus Data Strategy focuses.
A special thanks to Jacob Hennessey-Rubin and New York Robotics for consistently bringing together the people actually building in this space for real, substantive conversations.
#PhysicalAI #Robotics #AIData #TrainingData #EmbodiedAI #DataStrategy #RoboticsSummit #FoundationModels #ManufacturingAI
If you're working on data acquisition for physical AI, or sitting on operational or sensor data you haven't yet commercialized, I'd like to talk. hugh@nexusdatastrategy.com | nexusdatastrategy.com
Nexus helps AI teams source real enterprise data for specific AI use cases, faster and without legal or sourcing dead ends. We work with physical AI companies, robotics teams, and foundation model labs on the buy side, and with enterprises sitting on proprietary operational, sensor, and behavioral data on the supply side. Start with a free feasibility screen.



Comments