What separates a good finance agent from a weekend project? Inside Sage's AI architecture with CTO Aaron Harris
- Summary:
- During this exclusive chat with CTO Aaron Harris, we talked about the advances Sage has made in its AI architecture. Does this result in better finance agents? And what has Harris learned from his own agent building experiments?
The investor market tells us what they think about the forward value of SaaS. But SaaS vendors have a different narrative - and so do their customers.
Sage's analyst event brought these issues to a head last week, as we delved into an AI architecture that is altogether different than an out-of-the-box frontier model.
Sage CTO Aaron Harris would know - his team built that architecture. But Harris has also experimented with building his own finance agents on weekends (he calls his finance agent "Arthur"). The comparison he's seen between his weekend agentic pal Arthur and Sage Copilot is stark.
Harris has been bearing down on the limitations of frontier models for a while. As Harris told my colleague Phil Wainewright:
If you go back to that question, 'Which stock item will I need to re-order soon?' — GPT doesn't recognize it as an accounting question. It doesn't recognize that there's any accounting terms in that. Whereas the models that we've trained understand that it is actually an accounting question. It knows that stock is referring to inventory, and it understands the definition of that in the world that we operate in.
How domain-specific AI models for change cost pressures - and results
Sage AI (including Sage Copilot) is already widely available to customers - a topic we'll get return to as we head into the Sage Future user event next month. But for now - what did we learn from the provocative discussions with executives during the analyst event? During our quick-but-vigorous on-site podcast, Brian Sommer and I hit on SaaS margin pressures, the future of ERP, and what Sage customers had to say about AI adoption, and why they are more likely to consume AI services within Sage than build their own.
I had the chance to push into these same topics with Harris in Atlanta. In my recent opus on on agentic AI projects, I laid out three characteristics that (potentially) differentiates enterprise AI from vibey projects on frontier models:
- Context at the time of inference - getting AI systems the best information at any moment in time, for that particular company and user, in a well-governed way. (A work in progress, but the right approach).
- Constraining LLMs in a "compound systems" architecture, combined with other forms of machine learning, along with deterministic systems, and external tool calls to verifiers, rules-based automation, or sources of database truth.
- A promising distinction between off-the-shelf frontier models and domain-specific/smaller models, informed by relevant data.
Sage is utilizing all three, but for this piece, I'll focus on domain-specific and smaller models. There is a misconception that smaller models are just about inference cost mitigation. As Harris and team have learned, domain models for finance have big accuracy/relevance advantages.
Reducing inference costs by 97 percent? "The math here is pretty straightforward"
As I said to Harris: "We had some really good discussions around pricing and monetization, My view is that every vendor is going to have to work through some of that with customers in terms of what works best for their users, but one real key is LLM cost control. The more you can keep those costs down, the better your chances to get the pricing where you want it to be - stable. You talked about situations where you could get down to around three cents on the dollar in terms of the cost of your own models." Harris responded:
The math here is pretty straightforward. If you look at the latest reasoning models, they're in the trillions of parameters, right? One and a half, two trillion parameters. The smaller, more efficient models are maybe 700 billion to a trillion parameters. They're huge, and the number of parameters in a model is directly related to the energy requirements. These models are awesome because they are proficient in every domain, right? They can simultaneously get a great score in the CPA exam and the bar exam.
So why did Sage pursue finance-specific models?
When it comes to the sorts of activities that we need to be really, really good at, what we need instead is a hyper-focused, very efficient model. So we want that model to be able to support conversational back-and-forth, but in terms of its knowledge work and its understanding, the domain is much more limited. If you look at what we call the small language models that we train, they start from a seven billion parameter foundation. So the math there is pretty obvious. When you're looking at a trillion or two parameters versus seven billion, [our models are] orders of magnitude faster and more efficient.
It gives you tremendous flexibility in how you operate the models, because you don't have to have the kinds of special hardware that are required to host large models. There are so many reasons for us to operate these models, and the fact we can make them specialists means that they are actually more performant for the problems we throw at them.
How Sage's agentic architecture avoids semantic mistakes
When does Harris utilize larger frontier models? He cites two main reasons: when conversational fluency is at a premium, and when an agentic workflow needs an orchestration "brain" to map out and plan steps, tool calls, etc. But in most cases, smaller models are not only more efficient, but more effective.
During our prior podcast, Harris explained why LLMs can struggle with finance terminology. But at the analyst event, Harris went further into the architecture. For example? Sage can process questions that were getting blocked by content moderation policies of larger model (basic queries like "look up my account" were flagged). Sage calls their finance filter the "arbiter." Harris' explanation points to the potential of the "compound systems architecture" I cited earlier:
Well, I think there is a bigger question here: there is the model that is sort of the brain of the system. But it's equally important the system that you build around it, right? In this case, what you're referring to is this thing we call the arbiter. It's like a firewall that sits in front of the model, between the customer and the model.
Every prompt that goes into the agents and the model goes through this firewall, and every response from the model back to the user goes back through the firewall on the way out. Part of its responsibility is to apply semantics to the conversation that are tuned to the finance and accounting world - that are tuned to the industry that the customer operates in.
That's where your homegrown AI agents can get into trouble:
[In a general LLM], some of these things might trigger a content warning, without that context.
LLMs also struggle with regional terminology variations. But as Sage builds out their models, they tune through the obstacles in lingo. Consider the word "turnover."
In the UK, you don't talk about how much revenue a company produces, you talk about their turnover... But if you're talking about a US company, it's employee churn, right? Or it's how regularly do you turn over your inventory. It's not revenue. These things are baked into the local lingo, and this semantic layer needs to understand that.
Sage can also do this on a customer-specific basis, as Sommer and I discussed in our podcast. One example: a Sage customer that uses, shall we say, some adult terminology in their product line - a surefire content moderation barrier for a typical model. As Sommer said:
Aaron could have probably gone even harder on some of those examples that he was using... Even when they've built their own LLMs, they've had to put some interesting twists and changes in there. There was an example somebody gave about objectionable content... Sage has a customer that makes, let's say, some grown-up products, right?
Sage was able to address that semantic issue for that customer, without getting blocked by a content moderator.
The AI contrast - Harris' weekend adventures with "Arthur," his Claude Code agent
But how can we contrast an LLM approach to finance with Sage's architecture? How about Harris' own adventures with Claude Code and OpenClaw? As he told me:
I think a lot of people have a similar experience to me. When you start to play with things like Claude Code, and more recently, with OpenClaw,, on the one hand, there's this very magical experience that you're having where you realize you can be ten times more productive than you've ever been. For somebody like me, who stopped being a hands-on developer a few years ago, suddenly, I'm thrilled that I can be a current hands-on developer again. It's amazing.
Enter "Arthur," Harris' weekend finance agent:
The problem is that you actually do need to understand the world these agents are operating in, because they can make laughable mistakes. They've got behavioral problems. I created this agent that I call Arthur, to do accounting for me, because I was spending so much money on Anthropic with my OpenClaw setup, I needed an accountant to start keeping track of it all.
Alas, "Arthur" got a bit lost in spreadsheets:
Arthur wanted to use a spreadsheet to track everything. Arthur created the columns and figured out what the categories would be. One of the more entertaining things was just how inconsistent Arthur was, and how he would track the same invoice that I would get every day from Anthropic using a different category, randomly.
More vexing was: he would start to forget what the columns were for in the spreadsheet. He would put the amount in the category column. He would forget to put the invoice number in, and then he'd put it back later. He did this really strange thing, where for the first couple of days, he was forcing dates to be strings in the date column, but then just inexplicably, he stopped doing that. He went back to calling it dates. I asked him at some point, 'Hey, do you think you would be a better accountant if you could use accounting software?'
Uh oh:
To which my agent responded in the affirmative and gave me all the reasons why he would be so much better at his job if he had some accounting software to use. It's kind of a funny example, but it's also true, for all the reasons why a junior accountant should not just do the accounting in a spreadsheet - I should give them proper accounting software.
How does this compare to his use of Sage Copilot?
Well, the difference is that our product won't allow you to put a bad value in a field, right? You can't put a string into a date field. That's a really extreme and simple example, but one of the primary responsibilities of business software is to control data input, to make sure that you've got high quality data in the system.
My take
That sets up the themes for Sage Future in San Francisco: what will we hear from customers on their use of Sage Copilot to date? The customer panel we spoke to definitely plans to consume AI through Sage. How that plays out is the story we need to watch.
Will Sage be able to get across, in vivid ways, why building agents with Sage (and Sage partners) is better than going it alone, or taking a spin with an agentic startup? Raw enthusiasm playing with new tools shouldn't be diminished, but we all need to see how enterprise constraints lead to better results.
I wanted to hear more from Sage on so-called "context." While I don't think context graphs will ever live up to the trillion dollar opportunity hype , the potential to rethink decision making context is a topic I want to hear more from Sage about. In Atlanta, Rob Sinfield, SVP ERP at Sage, walked me through one standout example, where Sage built a shipping intelligence agent that pulled in tariff data, matched it to shipments at sea, and pulled in container tracking data - and then pro-actively recommended actions to the customer. To me, that's the exciting edge of where AI agents are realistically headed: combining quality system of record data with impactful external sources, and delivering new actions/decision points real-time.
Of course, to ensure Sage customer get the best results, partners need to be building out agents also - otherwise this becomes a free-for-all of third party agentic apps and vendors, losing context and governance. That will be the job of Udit Batra, VP Platform Intacct. I talked with Batra about how Sage Agent Builder can give partners the ability to build agents with the same context and via the same Copilot user experience. Hopefully we'll see some early examples of this in San Francisco.
During our podcast, Sommer and I shared our own expectations on what we wanted to hear in San Francisco:
I wanted to hear mature customer stories of (successfully) contending in volatile markets running on Sage, even if AI isn't a central part of that story.
Meanwhile, Sommer wants his "socks blown off" with some demos that show a whole different way of approaching finance/operations. Analyst expectations are always hard to juggle, let alone satisfy. Fortunately for Sage, it's delivering for customers that really matters. We're about to find out.