Main content

Why data curation skills are essential for agentic migrations - an Informatica use case makes the point

George Lawton Profile picture for user George Lawton March 24, 2026
Summary:
Salesforce recently completed an ambitious plan to migrate the Informatica Help system to the Salesforce AgentForce experience in 24 days ahead of its 30-day ambition. Salesforce VP of Data Science Irina Malkova walks through the essential role that human curiosity and experimentation played in making this ambitious project go smoothly.

collaboration

Large Language Models (LLMs) are certainly getting better at distilling and synthesizing complex information to solve new problems. It is easy to imagine simply throwing the right raw data at them to deliver a perfect customer experience. But, as it turns out, human-curated data science skills honed through the ongoing experience of diagnosing problems can also play an essential role.

That tension recently showed up for a small Salesforce team tasked with migrating 100,000 documents from the Informatica help site into the Salesforce customer experience platform as part of the recent Informatica acquisition. It was a part of a broader effort to not just acquire the company but to weave its processes, data tools, and institutional knowledge into the Salesforce platform. Salesforce VP of Data Science Irina Malkova was tasked with the ambitious goal to replicate what they built for the Salesforce help site over many months but for Informatica and to do it within 30 days of the acquisition closing.

Malkova assembled a team that was familiar with this kind of transformation. Shruti Agarwal, a data engineering MTS, and Madhu HC, a software engineering SMTS, handled the core technical execution around the clock from India and the US. Helen Matsumoto, a senior product manager, led UX and evaluation. Malkova shepherded the effort. Agarwal recalls that it felt like an extremely ambitious effort given the scope of documentation:

When we first saw the scale of the Informatica data across all versions, I thought to myself, 'Wait… we have to cover everything? All versions? Since the times I was a toddler?'

Indexes to agents

There was a lot to unpack when it came to migrating from a classic index search-based help center, as Informatica had developed, to the more granular conversational approach Salesforce has been building. On the existing Informatica platform, users would run keyword searches to find answers to highly technical integration questions. This worked, but it left room for improvement: users often had to scroll, rephrase their search, and sometimes still escalate to a human specialist.

The easy part of the project was ingesting the data. However, getting it into a state where it could actually support an LLM required a bit more work.

The data pipeline involved stripping documents of everything built for viewing by human users, like HTML headers, footers, and navigational elements, to distill the essence into more functional content. This content then had to be broken up into bite-sized chunks of around 512 tokens, which was roughly a thousand words. It was then enriched with vector embeddings that capture the meaning of each piece and finally loaded into a retrieval-augmented generation database to surface the right context for each query.

Malkova describes that this process went relatively straightforward thanks to a year and a half of doing this for Salesforce's own knowledge base. More importantly, they weren't sure if it would cleanly map to the new Informatica data, and they were excited that it went as planned. She explains:

The majority of those things we were able to reuse. We knew exactly — here's how you process HTML, here's how you parse PDF, here's how you chunk, here's the hybrid index that works best for us, here's the right token size. So we just had that playbook and we applied it to Informatica. And we were able to on the first go get decent results.

The end result was that the entire 100,000 document collection could be run through the pipeline on the first day, which left room for iterating on the data science transformations required to fine-tune different elements in the pipeline. Madhu HC reflects:

Building the Informatica Agent in 24 days (instead of 30!) with my team was intense, exhilarating, and incredibly rewarding.

Unexpected problem

Initially, everything seemed to be going according to plan. But then the iterative work surfaced something the team had never encountered before. For context, Salesforce is a cloud-based platform running on a single version across all customers. The concept of versioning simply doesn't arise in its own tools and as a requirement to surface to the knowledge base. Informatica, on the other hand, had spent decades selling on-premise software. As a result, it had an extensive library of documents covering different versions of the same products. Many of those documents were nearly identical with only small textual variations distinguishing them.

Although the team had some idea that there were different versions, it wasn't quite apparent how it would show up in shaping this transformation process. They eventually discovered a tricky problem through the evaluation process — an unexpected pattern in the metrics. Malkova says:

We saw something very interesting that we had not seen before, which is that the precision of that retrieval was good — meaning that all of the documents that our pipeline was delivering to the LLM were very relevant. But then the utilization of these documents was low, meaning that the LLM ultimately chose to use a small number of these documents that we delivered. And we were like, well, this does not make any sense.

What was happening was that the LLM, when confronted with multiple near-identical documents, began picking one and ignoring the rest. Matsumoto describes how the problem registered from the product side and how the solution emerged from a different kind of thinking:

Early on, we discovered a challenge: Informatica customers have various on-prem product versions that we identified through the data: the agent kept reporting that we were sending it too many documents that were duplicative. This stumped us at first. Our natural reaction was to filter to the right version before sending the document to the agent, but that wasn't technically possible.

The team ran several unsuccessful attempts to build a filter in the RAG pipeline with little success. So they took a step back and reframed the problem entirely. Instead of trying to narrow down the chunks that reached the LLM, they figured out a way to enrich what the LLM received so that it could make the version determination itself. Matsumoto explains:

After a few unsuccessful attempts to solve it, we were cutting close to the launch date. So we hopped into a long working session and talked through every technical and UX scenario possible. Ultimately, we landed on the simplest solution — that required us to shift our thinking. Rather than struggling to filter to the right version before we send to the agent, we would let the agent find the right documentation by adding the right metadata. LLMs are clever — and so are we.

In practice, this meant using the Salesforce Data Cloud to extract version metadata from the documents and prepending it directly to each chunk. This allowed the agent planner service to establish what version a user was running and surface the appropriate documentation in response. Agarwal recalls:

We'll always remember it as 'the famous last-minute tweak'. Although we were so close to the launch date, we paused, designed the change calmly, tested quickly, and collectively held our breath — until it worked.

The value of felt sense for data

I asked Malkova what it felt like to have a sense that something was wrong before the metrics had fully named it, sort of like the kind of intuition software quality engineers sometimes call a code smell. Her response:

I have never heard 'data smells.' It sounds awesome. I'm going to bring it back to the team. Yes, we smell data really well. My team has been doing this for over ten years. Before AI, we did ML. So we were building classic ML algorithms for Salesforce for a very long time. So yeah, we're very familiar with all the different parts in the pipeline that can influence things. And especially, most importantly, how do you catch them? Because I think a lot of complications in today's AI development process comes from it being really hard to diagnose what good looks like.

This diagnostic instinct for knowing which lever in the pipeline is likely responsible for a given pattern is what allows a team to move fast enough through the iterative phase. The version problem didn't even show up as being a version problem; it presented as an anomaly in the precision-utilization gap. This took a set of experienced eyes to read what that meant and define creative solutions around it.

My take

Many of the large AI companies are loudly celebrating the successes in the improvements to their newest models, and these improvements are real. Yet, there will always be uncertainty at the edge of what statistical correlation-based approaches like LLMs will be able to navigate on their own. This is where SaaS vendors and experts with deep familiarity with the nuances of customer data and how that data moves across different problem types will likely continue to play an important role in orchestrating the pipelines that LLMs depend on.

This success highlights the crucial role that careful data work and intuition play in getting the best results. Their pipeline reuse helped make the 24-day sprint possible. Their evaluation discipline helped surface the versioning anomaly before it became an issue in production, and their willingness to step back from an approach that wasn't working to re-frame the problem helped get them to the finish line. This combination of accumulated expertise and instinct is hard to replicate, and perhaps harder still to automate away.

Image credit - Pixabay

Disclosure - At time of writing, Salesforce is a premier partner of diginomica.

Read more on:
Loading
A grey colored placeholder image