Robot futures - from data centers to shopping centers as retail analytics helps robots understand human behavior (without crushing us!)
- Summary:
- Data about shoppers’ movements in stores and malls are an unexpected goldmine for the training of humanoid robots, reveals an ex-NASA engineer.
So, you’ve built your robot’s World Foundation Model and taught it about physics, cause, and effect. You’ve trained your humanoid in simulated environments and used teleoperation and video to complement that process. And you’ve built your Visual Language Action Model, so it can link words with actions, and perhaps added a Large Behaviour Model so it can infer probable outcomes. And finally, you’ve geo-fenced your robot and enabled it to warn other robots of its presence and intentions.
So, now you can unleash it on the human world, and everything will be fine, right?
Wrong. And that’s because you now need to do something subtler but, in many ways, just as important: teach your robot how people move around in real physical spaces, so you can figure out how to “keep it from smushing humans” in the words of one company’s spokesman.
That company is Standard AI, a data analytics specialist that helps clients in the retail business work out how shoppers use physical spaces. For example, where do they stop in a large department store, and why? What do they look at, what attracts them, and how do they behave when they see it? As the company says on its website, “The data you’ve been missing is walking through your store.”
But what has all this got to do with robots?
The answer is that Standard AI has realised that its massive, labelled data set of real-world environments – showing how people move around in 21,000 of them – is a goldmine for developers of physical AI. At present, robots are generally trained in unrealistic environments: in virtual spaces or large, static warehouses which lack vital information about the human behaviours that machines will encounter in the real world.
The company’s Standard Labs division publishes open datasets from its long-running research studies in stores and shopping malls, capturing anonymised human actions from overhead video cameras: interactions in spaces that are shared by hundreds of milling shoppers. Those datasets form behavioural world models, in effect, which can help robots anticipate human goals and actions at scale.
As its Chief Technology Officer, ex-NASA engineer Dave Woollard has swapped outer space for retail space, exploring people’s orbits within it. He says:
I started my career at NASA’s Jet Propulsion Lab while finishing my PhD in Computer Science. And the JPL is absolutely all the things that you can imagine: all your childhood notions of what space, robots, and the rest are like.
But at the same time, it's also just the real practice of engineering, and trying to coordinate a lot of people to build things that are incredibly complex. At the time, I was primarily focused on high-performance computing and how to structure large-scale software systems. But I was working with some brilliant scientists who were domain experts in Earth sensing systems: the data processing pipelines for instruments that would be put onto satellites. We had to build pipelines that were processing millions, or hundreds of millions, of data points, terabytes of data at a time when terabytes were scary!
Checking out a big idea
So, how did Woollard journey from data centers to shopping centers. What brought him back down to Earth?
Standard AI started in an area that was much hyped a few years ago but has since largely been abandoned: autonomous checkout. Not the self-scan systems with which most of us are familiar, but that short-lived experiment in which stores scanned shoppers’ baskets as they exited the store. Woollard laughs at the memory:
There is certainly a reason we're not doing it anymore.
He doesn’t say it, but the experiment was (predictably) abused by thieves, shoplifters, and opportunists for whom the technology offered plausible deniability of their crimes. But building it in the first place demanded some “really accurate models of how people moved through a physical space”, and how they interacted with and manipulated objects. He explains:
Ultimately every one of those autonomous stores was more of a technology showcase than it was a practical next step for retail – it was never about eliminating humans from shopping spaces. I’m a big fan of bricks-and-mortar stores and I always have been. But the byproduct of that was an interesting data set about human movement trajectories in physical spaces. So, we now understand those spaces very well, and the structure and goals associated with shopping. It’s all about human motion ‘priors’ and making robots more ‘legible’ to humans in those spaces, and humans more legible to robots.
But what does that mean? Woollard explains:
To make an inaccurate autonomous driving analogy, that car’s behaviour must be legible – understandable and predictable – to human drivers, as well as to other autonomous cars. When you have robots moving away from legibility and into controlled safety regimes, you get all sorts of problems: unpredictable behaviours that limit people’s acceptance of robots. In other words, they're not operating within a broader, shared understanding of the humans who are also using that space.
And we’re working towards better predictions of human movement, which is the inverse of that problem. So, how do we teach robots to anticipate human movement better, and vice versa? All that is necessary for us to reach a point where robots can actually assist us in the real world. We build 3D digital twins of human movements in physical spaces. That term's a little overused, but it’s essentially what we do. Most World Foundation Models have, arguably, done relatively little to help robots understand human actors. And visual foundation models, while they can be very impressive, also suffer from context limitations. They make for a great thirty-second demo, but an atrocious computational problem.
He concludes:
If today's reality is that we must teach robots not to crush humans, then tomorrow's will eventually be how do we teach them just to interact with us?
On that subject, how near to reality is the long-promised future of intelligent dextrous, general-purpose robots? Like most professionals working at the cutting edge of robots’ software, data, engineering, and training, Woollard offers a sober and philosophical view – a welcome antidote to the futurist claptrap and impatient capital that are so counterproductive in this sector:
A lot of it is hype. I have a six-year-old daughter. I'm pretty sure that I'll see some decently amazing capabilities in humanoid robotics in my lifetime. But I'm a lot surer that, in her lifetime, it's going to be a lot more achievable.Of course, a lot of folks in the market are incentivised to say that advanced humanoid robots are only a couple of years away, but the Robot Data Gap is real.
And I would point to a lot of different problems that are more on a systems level. Accurate manipulation is absolutely still a problem, and I think that to make a real shared space possible, where humans and robots can collaborate, navigation really isn't solved either. At least, not if your goal is legibility – where humans can understand and anticipate a robot’s movements, and vice versa.
Then he adds:
But will we have Rosie from The Jetsons in everyone's home? That’s doubtful to me, even in that timescale.
My take
More fascinating insights from a little-understood industry at a time when it risks being overwhelmed by hype. And here is someone within the industry talking, without irony, about The Jetsons TV show from his youth: a cartoon satire about runaway consumerism.