Main content

Something for the weekend - why enterprise AI progress is not where the industry thinks! Time to chow down on some snake tail?

Alyx MacQueen Profile picture for user alex_lee January 23, 2026
Summary:
Databricks’ research into instructed retrieval and the OfficeQA benchmark suggests that the hardest problems in enterprise AI are no longer about model intelligence. Instead, they lie in how systems interpret instructions, navigate fragmented data, and connect models to tools.

Ouroboros - Pixtomental © Pixabay
(© Pixabay)

Enterprise AI is often described as a pipeline. Data goes in, answers come out. In practice, it behaves more like a feedback loop. Each time organizations add new data sources, tools, and constraints, the system reshapes itself. Expectations rise, margins for error shrink, and weaknesses that were once invisible become harder to ignore.

There is an old image of a snake eating its own tail – the Ouroboros – used to describe systems that evolve by consuming themselves. Enterprise AI increasingly follows this pattern. Progress does not simply expand capability; it tightens the conditions under which accuracy is judged.

The research from Databricks suggests that many failures in enterprise AI emerge from this dynamic. As models become more capable and use cases more ambitious, the limiting factor is no longer language generation itself. Instead, it is the system’s ability to consistently retrieve the right information, apply the right constraints, and connect user intent to complex, fragmented data environments.

Michael Bendersky, Director of Research at Databricks, explains the problem in simple terms: 

The model reasons about the wrong results, not about the actual results that you should have been reasoning about.

In theory, Retrieval-Augmented Generation (RAG) was designed to solve this problem by grounding models in external data. In practice, Databricks argues that RAG often breaks down precisely where enterprise users need it most: when instructions are complex, data is heterogeneous, and precision matters. He describes the issue as structural rather than algorithmic:

There’s a bottleneck between the models and the tools, because tools were written for humans.

From web search to enterprise search

Bendersky joined Databricks after a career at Google and DeepMind working on search and machine learning. He describes his move as a shift in context rather than ambition: from web-scale search to enterprise information discovery.

In practice, enterprises operate across multiple information sources, each with its own structure, metadata, and governance rules. Bendersky explains that traditional RAG assumptions do not reflect enterprise reality.

People usually think of RAG as: you take a document collection, and then you sort of run some retrievals. They would have multiple sources of information, and each of these sources would have its own characteristics, its own schemas.

This heterogeneity changes the nature of the retrieval problem. Instead of searching a single corpus, systems must reconcile diverse data environments while respecting complex instructions.

Where RAG fails in real-world deployments

Databricks’ research highlights a recurring failure mode: systems understand instructions at the prompting layer but lose them during retrieval.

Bendersky explains that user intent often combines semantic meaning and metadata constraints. He says:

Some parts of the intent are about metadata, and some parts of the intent are about the actual content of the document.

When retrieval systems fail to interpret that distinction, they surface documents that are semantically similar but logically wrong. The result is not obvious hallucination but subtle mis-alignment.

Bendersky argues that many enterprise failures occur before reasoning even begins: 

It’s not that the model cannot reason about the thing. It cannot find the thing in the first place.

This re-frames the problem: enterprise AI systems are not primarily limited by model intelligence, but by the quality of the evidence they retrieve.

Instructed Retriever – embedding instructions into retrieval

Databricks’ response is an architecture it calls Instructed Retriever. Rather than treating retrieval as a single step, it decomposes the process into three stages: query decomposition, metadata reasoning, and contextual relevance. Bendersky explains that the first stage translates verbose natural language prompts into structured queries that retrieval tools can execute. 

You take the original user prompt, and decompose it into actual queries that the retriever understands.

He describes metadata reasoning as mapping those queries onto schemas and indexes, and contextual relevance as evaluating retrieved evidence against the original intent rather than ranking documents solely by semantic similarity. This architecture reflects a broader shift in enterprise AI design: retrieval is no longer a mechanical step but an interpretive process.

From incremental improvement to step change

Databricks claims that instructed retrieval delivers more than incremental gains, particularly in environments with highly heterogeneous data. Bendersky describes deployments where traditional RAG failed entirely:

We had customers who had literally tens of different sources of information… RAG would just not work.

He characterizes the impact of instructed retrieval as transformative rather than marginal:

This is not an incremental improvement. It’s a step change...before, this customer could not launch their system at all.

These findings align with Databricks’ benchmark results, which show significant improvements in retrieval accuracy and agent performance compared with baseline RAG systems.

Models versus systems

One particular theme in the research is its rejection of a model-centric explanation for enterprise AI failures. Bendersky acknowledges that larger models often perform better, but he argues that the industry is approaching diminishing returns in some areas. “We’re reaching a point of diminishing returns for some tasks,” he says, adding that “there’s a lot of headroom in how models interact with data and tools.”

He frames enterprise AI as a socio-technical system rather than a single technology. Even highly capable models cannot compensate for weak tooling, poor data pipelines, or misaligned retrieval architectures.

Databricks’ OfficeQA benchmark reinforces this argument by focusing on grounded reasoning tasks that reflect real enterprise workflows rather than abstract intelligence tests. The benchmark exposes a gap between what frontier models can do in controlled environments and what they can reliably deliver in complex data contexts. It also highlights a shift in enterprise expectations: as AI systems become more capable, tolerance for small errors declines.

Bendersky argues that organizations often misdiagnose the source of these failures:

People may complain that the model is not good enough. What they really mean is that access to the data is not good enough.

This insight alone shifts the angle of enterprise AI progress as a problem of infrastructure and integration rather than algorithmic breakthrough.

My take

Databricks’ research points to an uncomfortable truth for the AI industry: the most consequential advances in enterprise AI may not come from larger models, but from better systems.

The Ouroboros metaphor is useful not because it is poetic, but because it captures a structural reality. Each improvement in enterprise AI increases expectations of precision, exposes new failure modes, and tightens the constraints under which systems must operate.

In that sense, instructed retrieval is not just a technical innovation but a signal of where enterprise AI is heading. As organizations push AI deeper into operational workflows, the limiting factor will increasingly be the ability to translate human intent into machine-executable logic across fragmented data environments.

The broader implication is that enterprise AI is entering a phase where intelligence is no longer the primary differentiator. Reliability, interpretability, and system design are becoming equally decisive. If the industry continues to measure progress primarily in terms of model capability, it risks misunderstanding where its most significant bottlenecks – and opportunities – actually lie.

Loading
A grey colored placeholder image