The real bottleneck in AI? Your data estate

Alyx MacQueen

May 28, 2025

Dyslexia mode

Summary:: What if the real AI problem isn’t necessarily the model - but the data? Databricks CTO Dael Williamson explains why scientific discipline, not scale, will separate the winners from the noise when organizations are saturated in data.

AI technology icon over the network connection © LaymanZoom - Canva.com

If Artificial Intelligence is the shiny new engine powering digital transformation, enterprise data is the fuel – and it turns out most companies are still running on fumes. That was the grounded message from Dael Williamson, EMEA CTO of Databricks, when we spoke last week over Zoom.

Williamson is no stranger to complexity. With a background in drug discovery and decades spent navigating dense, unstructured datasets, he sees the current wave of AI innovation not as a technological leap, but as a systems-level stress test. One that many enterprises are failing badly.

Our conversation ranged from agentic AI to data governance, from synthetic data to intellectual debt. But the sharpest warning he issued was this – most enterprises are sprinting into AI with data infrastructure built for a different era.

We started the conversation by discussing whether we've reached "peak data" in the enterprise. Not because the world has no more data, but because high-quality, well-classified, and discoverable data is now scarcer than the compute power needed to process it. Williamson explains:

Most enterprises have no idea what data they have. If I ask them what data assets they have under management – like financial or human assets – they often can’t tell me. That’s the scale of the challenge.

The peak data paradox

At a macro level, Williamson likens today’s AI landscape to a car industry running on puddle water. Referencing the compute capabilities of modern AI systems, he said:

We’ve built Ferraris, but we’re trying to run them on whatever liquid we can find.

This moment of "peak data" isn’t about volume. It’s about value. Enterprises still sit on troves of first-party data that remain untapped, unclassified, and, in many cases, even undiscovered. From warehouses of paper records to digital archives of unlabelled video footage, the disconnect between ambition and data maturity is widening.

Williamson draws a useful analogy – imagine trying to manage physical or financial assets without knowing what you own. He elaborates:

Most enterprises don’t know what data assets they have under management.

It’s like decluttering your attic after 35 years — except your attic is a server farm. The only way to get through it is to use AI to help automate the labeling and indexing.

Databricks positions itself not just as a data platform, but as a partner in what Williamson calls "corporate archaeology" – helping companies unearth, classify, and ultimately activate their long-buried assets. This is the foundation, he argues, for data intelligence – not just managing data, but understanding and enriching it with metadata, lineage, and context.

Crucially, AI is now part of the solution as well as the challenge. Enterprises can and should use AI to accelerate their own data readiness. Techniques like batch inference allow businesses to pipe vast volumes of unstructured data through models that label, organize, and help surface high-value subsets. Williamson notes:

One of the best ways to build an AI capability is to use AI to get your data house in order.

The hidden value of AI exhaust

Williamson’s most compelling insight may be his call to rethink AI byproducts as fuel. He describes the massive volume of trace data, correction logs, and feedback loops that AI systems generate in operation – what he terms "AI exhaust."

Rather than discarding it, Williamson advocates capturing, refining, and transforming this exhaust into synthetic data sets. These can help fine-tune models, detect bias, and even fill in historical gaps. He observes:

There are companies with 20 years of sustainability reports, but no underlying datasets. With batch inference, you can generate structured data from these narratives and use that to recreate or supplement lost historical detail.

Synthetic data, in this framing, becomes a bridge over incomplete records and a mechanism for model governance. It’s an emerging form of IP creation, especially when fine-tuned with proprietary enterprise data.

A general model is like a pub quiz champion – it’s great at trivia, but not trustworthy for expert advice. You want a model that’s a master of one thing, not a jack-of-all-answers.

Agentic AI – clarity through constraint

The market has been flooded with claims about AI agents, multi-agent systems, and autonomous orchestration. For Williamson, precision begins with purpose, and the most straightforward definition of agentic AI I've ever heard:

The best way to describe an agent is – an autonomous compound system with agency over a task.

Databricks is focusing on narrowly scoped agents with specific functions and tight guardrails. Rather than unleashing generalized multi-agent chaos, the emphasis is on building trustable, explainable systems with end-to-end traceability. These compound models often include both generative and deterministic elements – one to act, and another to inspect.

This design philosophy draws from scientific method as much as software engineering: build models for action, models for evaluation, and frameworks for continuous monitoring. With open-source projects like MLflow and partnerships with labs like Anthropic, Databricks is investing in "mechanistic interpretability" – efforts to scan the “neural pathways” of a model much like an MRI reads the human brain.

Probabilistic systems and the governance gap

Much of the conversation around AI still assumes linear, deterministic systems. But enterprise leaders must now grapple with probabilistic, distributed systems – a new frontier that Williamson believes many are unprepared to govern. He continues:

We’ve solved deterministic tracing. Now we have to solve it again, for probabilistic systems.

This shift is driving changes not just in technology, but in talent strategy. Roles that once focused on administration are evolving into model QA and governance. Frontline workers, from nurses to logistics staff, will increasingly spend less time on data entry and more on validating or "marking the homework" of automated systems.

So how should companies respond?

First, Williamson urges leaders to move beyond AI aspiration and confront their data reality. Start with a service catalog, trace spending, and follow the money. Classify what you already pay to store. Then, gradually layer in AI to help unlock and interpret it.

Second, invest in proprietary advantage. Build with open-source models and unique data to generate intellectual property. Use off-the-shelf tools for commodity tasks, but don’t outsource differentiation.

Outsourcing just brings too much black box behavior. You’ve got to build your own capability – not just to reduce risk, but to own your IP. Differentiation starts with your data.

Finally, shift the strategic lens. Don’t just ask what the model can do – ask what it’s learning, how it’s evolving, and what new forms of data value it’s producing. In Williamson’s view, AI isn’t just a system. It’s a system of systems. He acknowledges:

There will be chaos. But we’re starting to see a formula emerge. And it starts with data.

That might be the real differentiator in the next wave of enterprise AI.