AI demands openness, say experts. But on whose terms?

Chris Middleton

March 26, 2025

Dyslexia mode

Summary:: A panel debate about openness, open data, open source, and AI development produced a veritable pile-up of enlightening ironies.

One can argue that the UK government's proposal to change copyright rules to allow AI training is misguided – as the media, Britain's $160 billion creative communities, and industry trade organization UKAI do. Or one can claim that critics are impeding innovation and progress, as Secretary of State for Science, Innovation and Technology Peter Kyle does, despite UKAI disagreeing.

But one thing most people agree on is that open data, access to data, and a general spirit of openness are critical for AI development, if the technology is to benefit humanity as promised.

So, the question then becomes what we mean by 'open'.

Most people understand concepts such as open source and open data; but there is prior recognition and agreement in either case that both data and processes will be open. For example, anonymized data sets shared with opted-in consent, or open-source software in which developers are free to collaborate, update tools, and/or access and modify the source code.

In both cases, there is an acknowledged code of conduct, and a community that shares it.

Data access is important for AI development, but – critics of government proposals say – that should not mean data owners being forced to open their content against their will. Especially if it means wealthy vendors – some of which stockpile their own IP – turning it into money, while giving data owners nothing in return (not even an acknowledgment).

Some vendors claim that original work has no intrinsic monetary value. But those same firms also claim they just can't afford to pay for it. That's absurd, especially when some are the wealthiest companies in human history yet opt to lift proprietary data from known pirate sources. Data that, in AI trainers' view, is critical and essential, yet also of zero value, depending on the observer. That's almost as if it is held in a quantum superposition: Schrodinger's data, perhaps. Is it alive or dead? Ask a lawyer!

Data scraping versus open collaboration

Open data and open-source development are generally about collaboration, fairness, and mutual benefit, whereas the scraping of proprietary content without permission, credit, or payment is one-sided and unethical – not to mention illegal. Text and data mining (TDM) exceptions, while inaccurate descriptions of training Large Language Models and generative AIs, do not apply in the development of commercial products, only to fair use in academic research.

That is what the government's proposal seeks to change. But critics of the plan feel it is rigging the market in favor of the few: a handful of US vendors. Those companies claim to be helping the many but look remarkably like they are just helping themselves.

These were among the issues touched on at a panel entitled 'AI Openness in the Age of DeepSeek's R1' at AIUK in Westminster last week, the Turing Institute event at the Queen Elizabeth II Centre, just a stone's throw from the Houses of Parliament. The context? The realization that China's DeepSeek has "exploded" – in the host's words – the "embryonic ecosystem of AI".

Certainly, it has challenged the fragile venture capital funding the likes of OpenAI, given that DeepSeek (available in v3 this week) appears to be both cheaper and more efficient than its US rivals. (Half a trillion dollars for an OpenAI datacenter? No need, says the Chinese company – the corporate equivalent of a Laughing Face emoji.)

So, where might the AI future go? And where does openness stand in all this? Does it just mean copyright owners being open to theft? (Or to theft being retrospectively redefined as fair use?)

A one-sided conversation

Speakers included Chair Amanda Brock, CEO of OpenUK; Dr Laura Gilbert, UK government AI advisor and Head of the AI for Government program at the Ellison Institute of Technology in Oxford; Alex Housley, founder of MLOps platform Seldon; and lawyer and OpenUK member Sonia Cooper, Assistant General Counsel at Microsoft, who also chairs the IP Federation's Copyright Committee – a poacher turned gatekeeper, it seems.

A panel of experts who, notably, all agreed with each other and shared the same perspective: that's what passes for debate these days. So, let's call it a conversation among allies and, in several cases, among colleagues at OpenUK. It was nice of them to let us listen in!

The conversation highlighted the importance of data access for AI development, citing examples like the Ordnance Survey maps and transport data – of known open data, in other words. Panelists emphasized the need for better data-sharing infrastructures, for partnership between the private sector and academia, and for international collaboration to drive innovation. Fair enough. Who could argue with that?

But they also discussed the importance of public trust in government AI initiatives, while at the same time refusing to debate the abuse of trust involved in scraping data that isn't open – I put a question about this to them, but it was ignored by the Chair.

For a 'debate' about openness, my impression was it seemed remarkably closed and stage managed, therefore. However, it did offer a glimpse into the world view of one of the planet's wealthiest and most valuable companies, Microsoft, also key partner and backer of ChatGPT maker OpenAI.

Essentially, their argument is this: any human is free to read a book or research paper, learn from it, and act autonomously with that knowledge in the future; perhaps they will write a book of their own, or express it via interpretative dance. An AI is a bit like that person who is learning about the world – or it creates the persuasive illusion of doing that – so why shouldn't it be allowed the same rights? And scene.

But in the real world, generally, a person buys a book or goes to a library that has bought that book. Also generally, they don't load millions of pirated books and research papers into a truck in order to create an automated, commercial competitor to them – without acknowledging that they have even read those texts. But let's leave that aside for now. (Metaphors are dangerous, as novelist Milan Kundera once said.)

Microsoft’s view

As previously reported, Microsoft's Cooper said:

You don't want AI to be outputting anything that is an infringement, or that is used in an infringing way. And you don't want to weaken any existing protection that is there for rights holders in relation to the use of what would be a copy of their work.

I think what's important to think about is to take a step back and think about the technical analysis that's involved when you're analyzing data. And again, what copyright is intended to protect.

If you look at the scope of copyright protection as it's set out, it is intended to protect the expression of an idea, but it's not intended to lock up the ideas, the facts, the information that is within a copyright work. [Says who?!]

That's really important, because it enables knowledge to be developed from being able to read from something, and then build on it. Ideas are something that you can protect through patent protection, but not through copyright protection.

She added:

If you are technically analyzing something, it's necessary to take a technical copy of something in order to extract the unprotected elements. And I think we need to be very careful not to say that copyright prevents us from making those technical copies.

Because essentially what you will end up doing is locking up the ideas, the knowledge in that information, that was never intended to be protected by copyright.

Again, says who?

However, it was notable that she – like her fellow panelists – limited her comments to an academic research paper context and steered away from discussing other forms of IP. That is perhaps because it is easier to claim that research is all about extending human knowledge than it is to claim that novels, songs, paintings, and movies should all be loaded into our metaphorical van for resale.

Openness in the AI era

So, what else did the panel say? Brock set out the difference between 'open AI' (not the company) and open source, in both fact and spirit. She said:

Open source is something very specific and has a 30-year established definition when it comes to software. And key to that definition are 10 requirements that the licenses meet. But at five and six, it says that anyone can use the code for any purpose.

Now, when we apply this concept of openness, of open source, to AI, we get into a more controversial area, because for it to really meet that traditional definition of open source, anybody has to use the code for any purpose. And that's not really what we see from day to day when we talk about openness and AI.

We often see that open-source standard, that free flow that has enabled the mass adoption of open-source software, not being met. But that doesn't mean that the AI that's shared without meeting that standard isn't also open.

Hmm. To explain this point, she shared something of the history of Meta's Llama LLM family, suggesting that – in her view – as leaked software originally intended for use by academics, it could not be described as 'open source'. After all, the leak was driven by competitive commercial pressures in the, ahem, open market.

She continued:

Fast forward to July 23, and you see OpenUK supporting the launch of Llama 2, even though it's not on a traditional open-source license. It's very much AI openness, not open source, because it has a restriction of seven million users. If you hit seven million users on any derivative work you create, you must get a license from Meta.

Then we see, as we move on, at the start of this year, DeepSeek R1. I think it's 20% of the cost of past LLMs. I believe it was created from distillation of Llama. So, if Llama hadn't been out there in the open, we wouldn't have R1!

A meta-irony, perhaps. Or a Meta one.

Commercial challenges

For his part, Housley set out some of the challenges of being an entrepreneur in an open-source space:

One of the big challenges is companies can freely use your software, and if enough of the value is contained within the open-source component, then there's no obligation for them to pay you.

Indeed, in the early days of the open-source movement, Linux disties such as Red Hat and SuSE would compete on services, not product, an opportunity that is harder to sell in an AI world, unless you are addressing specialist niches. That aside, he continued:

I think, from an open-source start-up perspective, your mission is ultimately to grow into a sustainable, fast-growth revenue business. Whereas, as a large enterprise, often your motivation is to create some software which can be commoditized, so you're not tying up your internal teams in maintaining and building it. And you can all share access to this resource over time.

So, there is a tightrope or a thin line to tread around how much you put into the open source. Too much can be damaging, but too little can also impact the adoption of the core technology.

A valuable perspective. And whichever side of that line you stand, the commercial incentive to commoditize as much of the world's data as possible becomes clear – even data that, legally, should not be scraped, remixed, and (effectively) resold in that way.

The government view

So, what is the government's perspective on all this AI openness? As (in effect) a government spokesperson, Dr Gilbert said:

There's a lot of work that goes into improving the delivery of frontline services with AI, and it's very human centric. So, the tagline is to try and make government more human using artificial intelligence. And we've gone open source very early in that journey.

When I announced that we were going to build this team, we said we're going to operate with 'radical transparency'. And there were a few reasons for that. One was the ethos that, really, we do believe in open source. But part of it was also public trust and ensuring that we can't make mistakes.

Wow. Let's draw a veil over the government's blundering about on copyright in AI training and attempting to force IP holders to open their data against their will.

On that point, however, Gilbert did share an amusing story – though I'm not sure she intended it to be funny:

One of the first things that happened when we open-sourced Redbox [Redbox Copilot, a Cabinet developed tool to summarize documents for civil servants] was that a corporate picked up the open source, repackaged it, and sold it back into the government!

"And I was a little upset about that, because we'd open-sourced it – sort of for transparency, etcetera. Of course, we're happy for people to use it. And we definitely built it for government, and for it to be used for free! But, obviously, we hadn't done a good enough job of advertising that it was sat right there.

So, seeing a company reselling it and saying, 'This is No 10 [Downing Street] certified code that we will now sell back to you…. that was really uncomfortable.

My Take

Well, now, isn't it a dog-eat-dog world? Having an AI company repackage your work and sell it back to you really sucks, doesn't it?