Your browser doesn’t support HTML5 audio
Mark Collier, Executive Director of the PyTorch Foundation, has spent his career finding the infrastructure beneath major technology shifts – and then making sure it stays open. In a one-to-one conversation at KubeCon Europe 2026 in Amsterdam, he made the case for why PyTorch is the most consequential open source project in AI right now, and why the governance of that dependency deserves far more enterprise attention than it gets.
He co-founded OpenStack in 2010, when cloud computing needed an open alternative to proprietary infrastructure, and spent over a decade building its governance and ecosystem. The pattern is the same at PyTorch: foundational software the entire industry depends on, originating inside a single company and needing a neutral home before the commercial stakes make that impossible. His move to lead the PyTorch Foundation, which sits under the Linux Foundation umbrella, reflects a judgment about where that point has arrived this time.
Every time a new GPU comes out and Jensen gets up on stage and says here's the new Nvidia hardware – the path to market to make those expensive, scarce GPUs do anything useful for AI is PyTorch. You have to. They cannot ship a chip that doesn't work well with PyTorch.
That is not an exaggeration. Open weight models are trained with PyTorch. Proprietary labs that do not open source their models still use PyTorch. The billions of dollars flowing into specialized AI accelerators have a single open source dependency in common. His light bulb moment came when he looked closely at the foundation's testing infrastructure – new versions of PyTorch being validated daily at massive scale across clouds filled with high-end Graphics Processing Units (GPUs).
That's the layer of abstraction. So if you want to have alternative architectures – like these new accelerators that are coming out of different startups – PyTorch is that layer that unlocks that for more options.
Three pillars, one foundation
Collier describes the AI landscape as built on three pillars: training, inference, and agents. Training creates the model. Inference serves it. Agents call it. The third pillar is, in his words, "less defined and more hyped up right now," but the first two are where the PyTorch Foundation's scope is firmly grounded – and where it has recently expanded.
When the Foundation became an umbrella organization in May 2025, vLLM and DeepSpeed were the first two projects to join, with Ray following in October 2025. The four projects – PyTorch, vLLM, DeepSpeed, and Ray – now span the critical layers of training, inference, and distributed compute. vLLM in particular has become the dominant open source inference serving engine: the software that runs a model to respond to user prompts, applications, or increasingly, agent calls. Inference is more complex than it might appear, and it is growing faster than training as deployed AI workloads scale.
Worth noting separately: llm-d, a Kubernetes-native inference orchestration layer built on top of vLLM, entered the Cloud Native Computing Foundation (CNCF) sandbox at this same KubeCon – a sign of how quickly the inference stack is maturing and expanding into distinct layers, each with its own governance home.
vLLM is similar to PyTorch in that it's adopted more widely than any other piece of software if you're trying to serve the models for inference.
Why enterprises should care about who controls the stack
Vendor lock-in came up repeatedly across KubeCon Europe 2026 – in platform engineering sessions, in the hyperscaler keynotes, and in the hallways. For AI specifically, Collier's take has two dimensions worth separating out.
One form of lock-in that rarely gets discussed is the risk of a world where all AI runs through a small number of proprietary model APIs. He does not dismiss those providers – they will continue to push the frontier and people will use them – but he is clear that they should not be the whole story.
One of the forms of lock-in that I think people should be concerned about is if we end up in a world where all AI comes through a small number of proprietary APIs. That would not be a great world.
The practical alternative is the ability to train your own models. During his keynote fireside chat at KubeCon, Collier used Uber – whose engineering team also presented a keynote earlier in the week – as a worked example of what that looks like in practice. Uber trains thousands of its own models, and the stack it relies on illustrates the cross-foundation dependency he is describing: PyTorch for training, Ray for distributed computing, vLLM for inference – all PyTorch Foundation projects – running on Kubernetes, which lives in CNCF. The newest CNCF project, llm-d, is tightly integrated with vLLM. On stage, Collier advocated for deliberate cross-community design:
We have to get out of our silos and tribes that deliver just pieces for users to figure out, and actually do co-design and co-evolve the full stack to deliver what the world needs.
Nvidia's dominance in AI compute is real and well-earned. But the competitive landscape is moving – AWS Trainium, Google TPU, Cerebras, and a string of startups are creating more options. PyTorch and vLLM are what makes those options viable in practice rather than just on a spec sheet:
The competition is coming to give people more choices, and both PyTorch as a layer of enablement and vLLM for inference are going to be super important to make that happen in real life, and not just in the lab.
What it means that Meta built it
PyTorch originated at Meta, which still accounts for a substantial portion of its development. The governance challenge that creates is real, and Collier addresses it directly.
The trademark is held by the foundation – not Meta, not any single company. That creates a contractual guarantee that no one can unilaterally change the license, declare an enterprise version, or otherwise pull the rug from under the community. Technical leadership roles are earned through contribution, not purchased through membership fees.
You can't buy your way into being a contributor on the project, which is a very important line that is always drawn.
Diversification of contributors is real but gradual, as it always is with projects that originate inside a single company. You cannot switch overnight from a project dominated by one company's engineers to a broadly distributed contributor base – the subject matter expertise sits with the people who built it. What you can do is create the conditions – transparent governance, open processes, neutral trademark – that allow trust to accumulate and contributions from other organizations to follow.
It just wouldn't be as adopted as it is if people weren't comfortable that it was neutral and that it was not something that could be ripped away from them.
The evidence that it is working is in the adoption numbers. When every new frontier model — open weight or closed — ships with PyTorch support on day one, that is not loyalty to Meta. It is a judgment by the entire industry that PyTorch is the standard.
Where agents fit in
These days an agentic AI conversation is inescapable. Collier's view is grounding rather than expansive, and he starts with a definition that helps to cut through a lot of the hype:
If you look at an agentic system at a simplistic level, it's just kind of like a loop that's running, and it's essentially like calling a model that was probably trained with PyTorch, and it's calling it through an inference infrastructure that's probably powered by vLLM.
In that light, agents are not something outside the foundation's scope – they are the foundation's scope, exercised in a new pattern. What changes as agents proliferate is the pressure on inference. Cloud infrastructure has historically been built around human-paced requests – one user, one action, predictable load. Agents break that assumption entirely:
Historically, when you think about cloud native, for the most part, you're thinking the application is being driven by a person, and now you have this idea of the end user might be an agent, and it might not call the app or the API at once. It may call it 100,000 times. And so you get all kinds of weird behaviors, and it changes how you monitor the infrastructure and things like that.
My take
PyTorch is probably the most important piece of open source software most enterprise technology leaders have never had a governance conversation about. Every GPU launch, every major model release – somewhere in the stack, there it is. The industry knows this implicitly. What is less common is seeing that dependency named, scrutinized, and taken seriously as an infrastructure decision in its own right.
Understanding who governs PyTorch, and under what principles, is due diligence for any organization making serious AI infrastructure decisions. It belongs in the same conversation as cloud vendor selection or Kubernetes distribution choice, and it almost never gets there. Hardware dependency on a single chip vendor is the risk everyone talks about – model dependency on a small number of proprietary APIs is the one fewer people are thinking through carefully, and PyTorch is the open source answer to both.
The governance question around Meta's contribution is legitimate and worth watching. The structural answers of a neutral trademark, transparent governance, and technical leadership earned through contribution rather than membership fees are in place. But structures are only as good as the pressure they're tested under, and the commercial stakes around AI infrastructure are not getting smaller. The adoption numbers are the most honest signal available: when every major model ships with PyTorch support on day one, that is the industry voting with its dependencies.