There will never be enough data to satisfy AI, so why not pool it?

Martin Banks

August 28, 2024

Dyslexia mode

Summary:: A White Paper from the ODI that takes a serious look at a fundamental issue for AI: no matter how many gen AI models get built, a significant part of the data used to train them will be common to most of the other gen AI models, so why not share it from a common data pool.

The Open Data Institute (ODI) recently produced a White Paper examining a number of issues that are now emerging as the generative AI bandwagon starts to encompass all and every activity business and individuals are likely be in need of or be associated with. Underpinning all of those issues is one key problem: data and that is the subject of the Paper. For a start, with gen AI services there will never be too much data, and ideally each use of even a well-established and widely-used model will be said to need its own data, just to be as sure as possible of the results delivered.

Managing that is going to be, arguably, one of the key issues over the foreseeable – and not so foreseeable - future. Its impact is likely to spread far wider than just the amount of data and just where it can be physically stored, but it will raise problems about future changes in technology (with all the classic forward and backward compatibility arguments that can help or hinder the continued use of some data). That is only the start, however, for there will be issues of data ownership and rights of application, entirely new data management techniques and technologies, the need for huge future investments in the above, and of course, the application of all the above at both a national and international level.

Part of the paper’s focus was on how this might impact the UK - and how the UK’s current technical and economic capabilities might impact it. The UK is currently fond of talking up its place in the world of AI – and that is a reasonable view of its ability to understand the issues. Its ability to be, or become, a world leader in the advantageous and, to use the word of the day, `ethical’ application of AI is still very much an open question.

On the ODI website the Institute’s Executive Chair and Co-Founder, Sir Nigel Shadbolt, pointed to the UK’s problem here:

“If the UK is to benefit from the extraordinary opportunities presented by AI, the government must look beyond the hype and attend to the fundamentals of a robust data ecosystem built on sound governance and ethical foundations. We must build a trustworthy data infrastructure for AI because the feedstock of high-quality AI is high-quality data. The UK has the opportunity to build better data governance systems for AI that ensure we are best placed to take advantage of technological innovations and create economic and social value whilst guarding against potential risks.”

Build a pool, think an ocean

It could be that such data governance models and tools become the key market the UK ends up ‘owning’, but there is a long, and global, way to go before any such ‘winners’ are declared. Before that happens some far more straight-forward data battles will need to be played out. For example, the current dominant players in gen AI service provision, such as Google, Meta, OpenAI and Microsoft Azure, will certainly be joined by others looking to take a significant share of the marketplace, and each will feel that its core data resources are its key differentiation, even though a large percentage of what each holds will be duplicates of what all the others hold. In other words, one of the key developments of the future is likely to require such leading contenders to realise they lose nothing and gain a lot by pooling common data and sharing it. Indeed, the exigencies of running global AI sustainably may well end up demanding it.

One person who seems to share this view is Elena Simperl, ODI’s Director of Research and Professor of Computer Science at Kings College, London, who argues:

There are competing interests and certainly the big tech organizations, that have released these foundational models in the last two or three years, want to maintain competitive advantage. Part of that is the data that they hold or have collected, scraped or simply reused from existing, open or publicly available sources on the Internet.

The need, therefore, is to find the way to convince, or at least to nudge, these players towards doing the right thing. In 2023, there were close to 150 foundational models released, yet at their core, they are based on essentially the same public content, some of which has been available for decades. Simperl sees this starting to bring its own troubles to each of the players, not least the resources each will require to clean that data.

In addition those services providers are already in commercial races to agree data deals with sources such as the media, such as the Financial Times and Wall Street Journal, she says:

But equally, they resort to other, perhaps less legal or ethical means to capture more data, because everyone seems to be convinced that they will be running out of data soon.

At the moment, it does seem that acquiring data, and the money that is expected to flow from having it, trumps the ethics of how it came into their possession. At the moment, having as much captive data as possible is seen as the best way to stay in the gen AI race, even if in practice much of it is the same.

Simperl, however, sees that the race could become an advantage in the long term simply because all the players have to position themselves to be visible in the marketplace. More than half of the models introduced last year, she suggests, are open source models, so will tend to have degree of commonality, which could in turn make having a degree of commonality in standard data sources that can be shared a growing possibility.

She sees open source as not quite the right term when it comes to AI, because it is not referring to classic Linux and its applications. She is referring to developments such as open-weight models, which allow users to access a bare model without any of the training data that will be used with it. This approach is now being promoted by Meta, particularly with its Llama 2 gen AI offering. She also observed that it can’t really be classed as open source because it does not expose the data used to train the models, Meta remains the gatekeeper of its own Llama models, rather than putting its data in an independent body, which is the ODI’s preferred long-term option.

Keep it clean

It is, however, not having a direct impact on moving the gen AI community towards the notion of shared data resources for that vast amount of data that is common to all, and can more easily be kept clean and validated, which is the ODI’s long term objective. Simperl says:

We have seen in the past that, at global scale, having one centralized approach is going to be hard, but what has worked in the past in a technology field is relying on open standards. I have worked with data that is published and accessed in a decentralised way, that allows every data publisher or data holder to decide which data they publish, when, and for what purpose. You don't need to assume that everyone in the world, or even everyone in the UK, is going to put their data in one place, but the availability of standards that are interoperable and that are commonly agreed means that, even if two organizations publish and hold their data separately, there aren't too many frictions when someone's trying to use both sources of data. That's the sort of scenario we have seen working, and we believe it can work at a large scale. Of course, when it's personal data there are questions around who should hold those data sets, whether it's the individual, for instance with the Solid Protocol, or whether there are other forms to empower people to have a say in who uses their data

There does seem to be a case for individuals, as well as businesses and other organizations, to manage two different versions of their own data – one genuinely private to be accessed only with explicit authority, and the other a more public, edited version that third parties, with legitimate ecosystem requirements can access at will.

According to Simperl, an approach along these lines is already being discussed by academics and technology providers under the generative name of Linked Data. It is a technology that the ODI has used and contributed to extensively, not least because it is a global movement that was launched by one of the co-founders of ODI, Sir Tim Berners-Lee. She explains:

The idea was exactly that, that you would have these data sources which would be uploaded on whatever servers. It could be my personal server, or I could decide to put it on a different server, but they would be organized in such a way and published in such a way that other data holders could link to that.

One of the answers to service providers that try to serve their own interests too enthusiastically is going to be regulation, particularly regulation concerning the increasingly real possibility of data that is of value to a large ecosystem should be available to that ecosystem rather than trapped behind a paywall. She acknowledges that it is still very early days in the process of defining what regulations will be needed to ensure transparency and access to model training data, but it is the ODI’s belief that remedies to data access will be necessary when companies have amassed high volumes of data that are widely duplicated, and therefore unsustainable, and/or actually belong to and of collective benefit to an ecosystem. She says:

There have to be remedies in place to allow other players to do that. So we've seen it's the same sort of conversation as we have seen in the US recently around Google and search data. I don't think it is going to be just the ODI changing the world, but we are clearly part of a movement, and we see this slowly emerging in some of the positions that some regulators are picking up, and other discussions with civic society organizations.

Among ODI’s current targets are the development of benchmarks, AI safety and AI transparency, often working in collaboration with academia and the growing AI technology and vendor communities. It also participates in a growing range of AI industry bodies. One such, which she sees having long term potential, is an organization called ML Commons. This is a member-based organisation that runs community working groups that everyone and anyone can join for free.

My take

It is reasonable to presume that the idea of some form of `universal data pool’ covering global AI endeavors is a possible – but not foreseeable - and more likely a ‘pretty damned unlikely’, especially if the main model builders and service providers continue long term to see `their data’ as their primary advantage. However, it is also possible to see a strong need for it.

There are the immediately obvious needs that come from sustainability. AI gives a huge kick to the already outrageous exponential data growth rate, so that energy supply, physical resources and even memory chip manufacture may all be out-paced by it. Then there is the subject of, to call it something, global coherence. With AI already a global animal, and with a growing number of applications having a global reach, having the coherence of impact that comes from utilizing common data as part of a model’s training materials makes sense: indeed it could stave off yet to be understood disaster scenarios.