Main content

Robot Futures #1 – why your dirty socks are preventing the future


Chris Middleton Profile picture for user cmiddleton February 10, 2026
Summary:
Super-intelligent, general-purpose humanoid robots are within our grasp, says the hype. Not so fast, says reality. Allow meus to explain.


socks

NVIDIA CEO Jensen Huang recently claimed that the challenges of designing and training humanoid robots can all be solved by simulation. But he’s wrong, as I will explain. But first, let’s zoom out and look at the big picture, before we zoom back in and start picking apart the hype.

It used to be the case that there were six adults of working age for every retired person. Then that ratio became four to one. And now there are just two people working for every one who has left the workforce. That was according to Ken Goldberg last year, Professor and Chair of the Industrial Engineering and Operations Research Department at the University of California, Berkeley, among several prestigious posts.

Other assessments differ, but this is the demographic timebomb that is ticking beneath many countries, such as the UK, Japan, and parts of Europe. All saw a postwar population spike: the Baby Boomers and Generation X.

Today, birth rates are declining in many nations. So, as people born in the Nineteen Sixties, Seventies, and Eighties hit retirement age – with access to better healthcare, medicines, and diets than their forebears – it stands to reason that there will be escalating crises in social care, healthcare, and manual labour. Not only will there be insufficient workers to look after ageing populations, load delivery trucks, and perform all manner of service functions, but there may also be no desire to do that work among those that are physically able.

This is one reason why the postwar technology optimists have been working towards a long-held dream for some: the creation of a safe, reliable, intelligent, dextrous, general-purpose human equivalent: a humanoid robot, as opposed to an industrial device that is designed to perform the same tasks repeatedly, in situ, on production lines.

But logical though the creation of an intelligent general-purpose robot may be, there is a problem with achieving that vision. Speaking at the AI for Good (AI4G) Global Summit in Geneva in 2025, Goldberg explained:

We all have to move things, we have to make things, and we have to maintain things. In particular, we want to maintain ourselves and our parents and grandparents. So, we are going to need robots.

And we're seeing these videos: they're here and they're doing things, right? Well, sort of. If you watch their hands, the hands are not really doing that much. In fact, and I know this is going to disappoint some people, robots are extremely clumsy. We’re trying to fix this, but we are not doing that well.

Paradoxical thinking

One reason for this is Moravec’s Paradox – the challenge expressed by computer scientist Hans Moravec, adjunct faculty member at the Robotics Institute of Carnegie Mellon University in the US. In 1988, Moravec wrote:

It is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers, [but] difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.

Moravec’s thinking was criticised at the time by others in the cognitive and computer sciences field. Some believed it was a myth, and others that the challenge would be overcome in the decades ahead by Moore’s Law, in effect: by the agglomeration of greater and greater processing power.

However, as it applies to intelligent robots and physical AIs, the Paradox turns out to be true. As Goldberg put it in 2025:

We have incredible dexterity, we take it for granted. You can hand us any kind of object, and we can pick it up, no problem. But robots? They're not very capable.

The robot data gap

The problem is not that artificial hands are not good enough – though developing sophisticated, dextrous hands is a big challenge in humanoid robot design. And is not that we lack fast processors – the El Capitan supercomputer, for example, has a performance of 1.7 exaflops (1.7 quintillion calculations a second). It is because we lack the relevant data.

Boiling Moravec’s Paradox down to an aphorism, it can be expressed thus: what a human finds easy to do, a robot finds difficult – and vice versa. More poetically, it is easier for a robot to explain the Renaissance, the history of the Bauhaus, or Einstein’s theories of gravity than for it to sort your dirty socks.

The reason for this counterintuitive idea is simple: there are reams of textual data – books, research reports, academic papers, scholarly articles, and essays – about the Renaissance, the Bauhaus design school, and relativity, but there is very little data about how to sort your socks. At least, no data that is useful to an AI-enabled robot, as those actions must exist in the physical world, and not in the theoretical.

An AI/robot could tell you what a sock is, because somewhere in the billions of webpages that its programmers scraped to train a Large Language Model would be that concept. But recognising your socks – in all their different colours, designs, orientations, and physical states – and picking them out from a pile of laundry demands a huge mass of data.

But where is that data? And where is all the data needed to train robots to carry out thousands of other tasks, too, involving countless different objects, scenarios, locations, contexts, and other variables? The answer is: not on the internet.

For years the running joke about Britain’s Doctor Who TV series was that fascistic mutants the Daleks could be stopped by a simple staircase. But who knew that, in the real world, robots might one day be defeated by your socks? Joking aside, the problem is this: we need robots to do our dirty work. But if they can’t, then AIs will have to do all our creative work instead, while we empty our elders’ bedpans.

Believing that an intelligent robot would, in some way, be able to work these problems out for itself is to misunderstand the nature of machine intelligence. Goldberg explained:

You hear a lot of people saying, large data solved the computer vision problem, based on the billions of images available online. And large data solved language, with ChatGPT and all those variations that you can talk to, based on trillions of words, sentences, and documents online. And therefore, large data will also solve robotics.

But my question is, when? With the internet, you could argue that one of its great purposes was to collect vast amounts of data. But we don't have that data for training robots: it's not on the internet. So, where is it going to come from?

A good question. Indeed, it begs another: is the mass automation of creative tasks by AI vendors really an acknowledgement that, however well designed and engineered they may be, robots just aren’t good enough yet?

But I digress. Goldberg set out the scale of the data-collecting task necessary to fill the training void:

We have collected approximately a year of robot data [data about physical tasks that would take a human one year to read]. But the amount of other data, i.e. text and images, is 100,000 years.

This problem is called the Robot Data Gap. We have a mass of data in those other two categories, 2D images and text, but we don't have an equivalent amount of data for robots to train robots to understand the physical world.

So, how best to close that 100,000-year gap?

Is simulation the answer?

One option is simulation: training virtual robots in simulated environments first, before unleashing their physical counterparts in the complex human world of hot, cold, safe, danger, cause, effect, and consequence.

This is where NVIDIA’s much-feted CEO comes into this discussion. Speaking on 3 February Huang said:

This is the future. This is where the era we're in now, using digital design, virtual twins of the products. They're going to be built by virtual twins of robots operating in virtual twins of factories. And all of this will be designed digitally in tools. And they will be designed, validated, operated, and all done [sic] inside these virtual twins running on top of Nvidia sim [simulations]. This is just the way it's going to be done going forward.

Yet as Goldberg explained last year, simulation alone is just not up to the task of training robots to perform complex manual tasks in unpredictable environments. However, it may be enough for simpler, cruder applications in controlled environments – for example, gross manipulation and box shifting.

So, simulation is adequate for some tasks, explained Goldberg:

It works well for flying robots, for UAVs [Unmanned Aerial Vehicles] and drones, because you can simulate them easily. You can take that simulated data and run it on real robots, and it works.

It also turns out that you can simulate robots walking very well. When you take that simulation, you can run it on real robots, which is why you're seeing all these amazing tricks: the acrobatics, the backflips, the robots boxing. Those are major results, but that's the body moving, not the hands.

So, it turns out that simulation does not work well for manipulation. We've been working on it, but we can't get it to work via simulation. It's just too approximate. There are too many errors, so simulation is not going to work.

Goldberg is right. A robot that can approximately take a blood test, approximately administer a drug, approximately cut up nuclear waste, approximately extinguish a fire, approximately clean a piece of delicate glass, and approximately lift a disabled patient out of bed would not be a viable machine. It would be a dangerous one.

What about video?

There are billions of videos on YouTube alone, with up to 500 hours of content uploaded every minute (by some estimates). So, can’t a robot be trained on that?

Again, this is to misunderstand the nature of the challenge when it comes to building robots’ model of the physical world, and then letting them move around in it, interact with people, and manipulate complex objects. As Goldberg explained, “The problem with videos is they're two-dimensional. So, I have images, but I don't have the three-dimensional structure. And that's the robot data I need.”

Being able to extract or accurately infer 3D data from 2D sources would go a long way to filling the Robot Data Gap. But Moravec’s Paradox kicks in yet again: things that humans find easy and instinctive – for example, telling the difference between a large object that is far away and a small object up close (the ‘Father Ted Problem’, perhaps) – a computer would struggle with. Similarly, is that a photo of a car on a billboard, or a real car hurtling towards you? A human knows, but a robot does not.

In the physical world, such barriers can be overcome with sensors and technologies such as LiDAR (Light Detection and Ranging, which sends a pulse of laser light to accurately measure distances), but at the massive-data training stage, it is difficult.

That said, GPU and AI giant NVIDIA is one of the many companies working towards solutions. Its Cosmos product links World Foundation Models (WFMs) with behavioural guardrails and data processing libraries, with a view to accelerating the development of autonomous robots and driverless vehicles (plus AI agents that can analyse video).

What is a World Foundation Model?

A World Foundation Model is trained on data about the physical world. The purpose is to build an internal representation of that world for robots, one that incorporates physics (matter, energy, spacetime, and how each behaves), and predicts the possible outcomes of actions. In this way, it helps a robot learn about the physical environment in which it will work and plan its next move within it.

In robotics and physical AI, there are several related disciplines, which include Visual Language Models (VLMs), Visual Language Action Models (VLAs) and Large Behaviour Models, all of which are visual reasoning systems. A VLM attempts to link words (i.e. verbal instructions) with objects, a VLA adds actions to that mix, while a Large Behaviour Model, like a WFM, enables a robot to predict the likely outcome of an action from a mass of real-world data.

In NVIDIA’s case, the Cosmos Predict element of its product aims to predict dynamic environments’ future states, which it does by being able to generate up to thirty seconds of video from multimodal prompts; the Cosmos Transfer component helps the model convert 3D simulations into high-fidelity video; and the Cosmos Reason element leverages all that knowledge within the real world – or tries to.

Despite this, the core challenge remains: is there enough data about the real world to train these models in the first place? And where might it come from? Goldberg explained:

Cosmos can predict the next frames of a video or generate a new video, and that's very impressive. But three-dimensional structure – understanding the motions of objects in space, the dynamics – that is still not solved. We don't know how to do that from videos yet.

Of course, he said this in 2025, and the science is fast improving.

What about teleoperation?

In the meantime, logically the most accurate way to gather data from the 3D physical world – plus the fourth dimension, time – is to take data directly from it. That brings us to teleoperation: directly controlling the robot at a distance (assuming the distance is not so large it creates a communications timelag). In many cases, the human operator is nearby, using a wired connection.

We generally think of teleoperation to mean a human controlling a robot via a haptic interface (smart gloves, for example), and via virtual, augmented, or extended reality (VR, AR, or XR) headsets. In this way, a robot becomes a human’s physical avatar, mimicking the operator’s movements exactly, as though it were a mechatronic puppet. In return, the operator sees what the robot sees, via the headset or smart glasses.

However, were all the data from those teleoperated actions added to the robot’s world model over time, then it would be a viable means of training it to carry out certain tasks and to handle some objects.

But the flaws of teleoperation as a means of training a robot should be obvious: first, it is slow: a single task being taught in real time in one location; second, it is boring for the operator (a significant downside for any company that offers teleoperation as a service); and third, generalising that data is risky and could lead to dangerous approximations.

In the latter case, let’s say you have used teleoperation to train a robot, in one lab, to handle a beaker containing a volatile chemical. Would you now trust it to, autonomously, pick up any container of any substance in any location? Hopefully, the answer is no!

Goldberg acknowledged all this when he said:

The human is basically a puppeteer getting the robot to do these things over and over again. It's collecting data, but it’s very tedious. My students do this and, after a couple of hours, they're like ‘I'm done, I'm not doing this anymore.’

It’s slow and it’s painful, but it is being done. Companies are doing it, and physical intelligence has been set, and maybe hundreds of people are now training a robot to, say, fold laundry by standing there and operating it like a puppet.

The data flywheel approach

So, if simulation, video, and teleoperation – singly or combined – are insufficient to overcome Moravec’s Paradox and fill the Robot Data Gap quickly enough to build a viable, intelligent, and (above all) trustworthy robot, then what is the answer?

Goldberg proposed what he called the data flywheel approach, which (to paraphrase his lengthy explanation) is creating a data feedback loop to reduce the 100,000-year data gap to something more manageable – to ten or twenty years, perhaps.

But what does that mean? As he explained it, it means first coming up with a relatively simple, commercial, cost-effective robot, something that people both want and will use for simple, repeatable tasks. The data gathered from that product – in every location where it is used – would then be recycled to develop a more sophisticated machine, which could perform more complex, but still repeatable tasks. And so on.

This perpetual feedback process would continue until, somewhere down the line, the mass of accumulated data feeds back, via the flywheel effect, into something like a general-purpose robot.

When reality dawns

But that brings us to an entirely different set of problems. First, that process of constant, circular innovation – that flywheel of iterative improvement over many years – demands one thing: patient capital, rather than a quick return on sunk costs.

Unfortunately for the world’s Big Techs, AI behemoths, and well-capitalised start-ups, social media pressures have excited investors with the promise of near-instant success for humanoid machines. So not only is there a colossal data gap in robotics – one largely hidden from the public – but there is also an expectation gap. One that a certain type of CEO has created for no other reason than it increases their share prices and guarantees social shares.

In short, people want the sophisticated, superintelligent robots they have been promised. But here’s the problem: disappointment can exhibit a flywheel effect too.

All of which brings us to problem number two. At this point, the binary, on/off, ‘if this, then that’ world of computer science butts against things that are much harder to model than picking up a box in a warehouse: human beings’ emotions. Plus, the century of sci-fi and the millennia of storytelling that technologists are drawing on to excite consumers – a public that is already exhausted by relentless disruption, change, and frequent disappointment.

My take

Technologists are right that humans have a dual nature: the instinctive versus the reasoned, as Moravec understood. Over the centuries, that duality has been expressed in many ways: the id and the ego, the emotional and the rational, the Apollonian and the Dionysian, and so on.

As businesspeople, technologists have long understood that, despite our tools, cathedrals, suspension bridges, spaceships, symphonies, poems, paintings, and theories about quantum gravity, on a species level humans are still pleasure-seeking primates. Press button, get banana.

So, offer us ‘free’ stuff – instant music and movies, effort-free art, and cognition with a click, all trained on a mass of stolen data – and of course many people take it. And many will choose filling their waistline over reading The Wasteland and TikTok over Tarkovsky too.

But you can only sell us a $20k mechatronic puppet that serves no useful purpose once.

Goldberg said it himself: 

This is a path to get us to those general-purpose robots that we've been waiting for. But please be patient, okay? We're going to get you your robots, but it's not going to happen soon. The robot that can do everything, that you can have in your house to be your butler, that's not coming very soon, okay?
 

Image credit - Pixabay

Read more on:
Loading
A grey colored placeholder image