Speaking of AI-generated voice technology, how do organizations get the tone right?
- Summary:
-
The AI voice generation market is growing apace as organizations aim to cut costs and boost efficiency. But do customers even like the technology?
While voice generation may not be top of everyone’s list when thinking about AI, the market is growing at a surprising lick.
In fact, researcher MarketsandMarkets predicts it will hit $20.71 billion by 2031, up from $4.16 billion in 2025. This equates to a healthy compound annual growth rate (CAGR) of 30.7%.
This growth, it says, is being driven by demand for conversational AI, voice automation, and omnichannel voice experiences as organization try to enable “hyper-personalized customer engagement”. As Nick Lahoika, Founder and Chief Executive of Estonian company Vocal Image, whose app provides AI-based soft skills coaching, says:
AI voices are used more often than you might think in different niches, such as audio books and podcasts, and more recently video-generated images. The drivers are ease of deployment and not needing to go somewhere to find actors and negotiate rates, rent a studio, work with sound engineering and the like. So, it’s fast, efficient, and cheap.
The MarketsandMarkets report comes to a similar conclusion:
The synthetic voice segment is expected to register a higher CAGR than the natural voice segment during the forecast period, driven by rapid advances in neural TTS (text-to-speech), diffusion-based audio models, and real-time cloning technologies. Enterprises across media, gaming, advertising, and e-learning are increasingly replacing traditional voice recording workflows with AI-generated voices that can scale across multiple languages, tones and content formats.
In fact, the media and entertainment industry boasted the largest market share last year. MarketsandMarkets explains:
As audience expectations shift toward global, localized, and multi-lingual content, AI voice technology has become a strategic asset for accelerating production cycles, reducing dependencies on physical studios, and ensuring creative flexibility, cementing the media and entertainment sector as the largest end user enterprise segment in 2025.
North America, on the other hand, was voice’s largest geographical market due to its strong technology ecosystem, and early enterprise adoption. Another important factor was the concentration of AI infrastructure providers in the region.
Do customers even like AI voices?
But while organizations may be keen on AI voice generators as a means of speeding things up and slashing costs by cutting humans out of the loop, just how popular is the technology among customers themselves? According to a recent study, not very - at least if the voices sound synthetic.
The research, which was conducted among 10,000 users by Vocal Image, found that once listeners realized a voice was AI-generated, most found it a big turnoff and went elsewhere. Acceptance did not vary significantly with age either, meaning it was equally unpopular with young people as it was with older users. As the report points out:
There’s a very strong negative correlation (r = -0.80) between AI detection rates and approval rates across all providers. When users detect a voice as AI-generated, they overwhelmingly reject it, explaining why the most successful providers prioritize sounding authentically human.
A key issue here, Lahoika says, is that:
People get frustrated when they know they’re speaking to an AI voice. It’s a trust issue. It’s not so much of a problem if they’re listening to something like an audiobook, which is about sharing content. But if someone wants to get information or have a query answered, they need a higher level of trust. While users often skip when they detect an AI voice, some high-quality voices keep them engaged until the end, proving that it all depends on the quality of the voice.
As for who is most likely to spot an AI-generated voice, that accolade falls to the Brits. Users here (43.5%) are a huge 6.5 percentage points more likely to detect non-human vocalization than their US peers (37%), although it is unclear why.
When compared with non-native English-speaking people though, all native English-speakers tend to give AI voices a lower rating. As the report points out:
Native speakers have finely tuned expectations for natural speech patterns, making them significantly better at detecting synthetic voices. Non-native speakers prioritize clarity over authenticity and are less sensitive to subtle artificial artifacts.
Loved and loathed voice characteristics
As to the specifics of what makes users love or hate a particular synthetic voice, there were various patterns. People largely preferred voices that sounded confident (19%), clear (11%), and authentic (10%). But they were not so keen on those that sounded AI-generated (-36%), monotonous (-7%), and nasal (-5%).
Interestingly though, there was a huge, threefold quality gap between the 20 highest- and lowest-performing AI voice models (86.2% versus 29.2%) aka how convincingly human they sounded. As the report explains:
The best providers succeed precisely because they minimize detectable AI artifacts. MiniMax has only a 12.8% AI detection rate, while low-rated Speechify is flagged 67.8% of the time…[But] don’t assume users will reject [all] AI voices – 67% approve when the quality if high.
Top of the tree were specialized digital-native AI startups. These consisted of:
- MiniMax (86.2% approval rating)
- PlayHT (85.6% approval)
- WellSaid Labs (82%)
- Lovo AI (81.4%)
- Descript (80.2%)
Literally the only one of the Big Tech vendors to be found in the top 10 index though was Microsoft. It came in at number eight with a 73.2% approval rating. Overall Big Tech with their general-purpose AI platforms, meanwhile, had an average rating of 64%.
The upshot of all this, says Lahoika, is that if voice models are a core part of your business, as they are with Vocal Image, executives should:
Consider start-ups. The price might be higher than with more established companies, but the quality is much better, and the innovation moves much faster too
My take
The overall finding of the Vocal Image report seems to be that users are happy to interact with AI-generated voices as long as they think they’re human – or consider them good enough quality to nearly be human. The point being here, that humans like talking to humans. And the hope being here that being able to do so does not become an increasingly rare and expensive luxury over the years to come.