AI Providers Cutting Deals with Publishers Could Lead to More Accuracy in LLMs
While many have proclaimed the arrival of advanced generative AI as the death of publishing as we know it, over the last few weeks, we’ve seen a new shift which could actually drive significant benefit for publishers as a result of the AI shift.
Because while AI tools, and the large language models (LLMs) that power them, can produce astonishingly human-like results, for both text and visuals, we’re also increasingly discovering that the actual input data is of critical importance, and that having more is not necessarily better in this respect.
Take, for example, Google’s latest generative AI Search component, and the sometimes bizarre answers it’s been sharing.
Google chief Sundar Pichai has acknowledged that there are flaws in its systems, but in his view, these are actually inherent within the design of the tools themselves.
As per Pichai (via The Verge):
“You’re getting at a deeper point where hallucination is still an unsolved problem. In some ways, it’s an inherent feature. It’s what makes these models very creative […] But LLMs aren’t necessarily the best approach to always get at factuality.”
Yet, platforms like Google are presenting these tools as systems that you can ask questions of, and get answers from. So if they’re not providing accurate responses, that’s a problem, and not something that can be explained away as random occurences that are always, inevitably, going to exist.
Because while the platforms themselves may be keen to temper expectations around accuracy, consumers are already referring to chatbots for exactly that.
In this respect, it’s somewhat astounding to see Pichai acknowledge that AI tools won’t provide “factuality” while also enabling them to provide answers to searchers. But the bottom line here is that the focus on data at scale is inevitably going to shift, and it won’t just be about how much data you can incorporate, but also how accurate that data is, in order to ensure that such systems produce good, useful results.
Which is where journalism, and other forms of high-quality inputs, come in.
Already, OpenAI has secured a new deal with NewsCorp to bring content from News Corp publications into its models, while Meta is now reportedly considering the same. So while publications may well be losing traffic to AI systems that provide all of the information that searchers need within the search results screen itself, or within a chatbot response, they could, at least in theory, recoup at least some of these losses through data sharing deals designed to improve the quality of LLMs.
Such deals could also reduce the influence of questionable, partisan news providers, by excluding their input from the same models. If OpenAI, for example, were to strike deals with all the mainstream publishers, while cutting out the more “hot take” style, conspiracy peddlers, the accuracy of the responses in ChatGPT would surely improve.
In this respect, it’s going to become less about synthesizing the entire internet, and more about building accuracy into these models, through partnerships with established, trusted providers, which would also include academic publishers, government websites, scientific associations, etc.
Google would already be well-placed to do this, because through its Search algorithms, it already has filters to prioritize the best, most accurate sources of information. In theory, Google could refine its Gemini models to, say, exclude all sites that fall below a certain quality threshold, and that should see immediate improvement in its models.
There’s more to it than that, of course, but the concept is that you’re going to increasingly see LLM creators moving away from building the biggest possible models, and more towards refined, quality inputs.
Which could also be bad news for Elon Musk’s xAI platform.
xAI, which recently raised an additional $6 billion in capital, is aiming to create a “maximum truth seeking” AI system, which is not constrained by political correctness or censorship. In order to do this, xAI is being fueled by X posts. Which is likely a benefit, in terms of timeliness, but in regards to accuracy, probably not so much.
Many false, ill-informed conspiracy theories still gain traction on X, often amplified by Musk himself, and that, given these broader trends, seems to be more of a hindrance than a benefit. Elon and his many followers, of course, would view this differently, with their left-of-center views being “silenced” by whatever mysterious puppet master they’re opposed to this week. But the truth is, the majority of these theories are incorrect, and having them fed into xAI’s Grok models is only going to pollute the accuracy of its responses.
But on a broader scale this is where we’re heading. Most of the structural elements of the current AI models have now been established, with the data inputs now posing the biggest challenge moving forward. As Pichai notes, some of these are inherent, and will always exist, as these systems try to make sense of the data provided. But over time, the demand for accuracy will increase, and as more and more websites cut off OpenAI, and other AI companies, from scraping their URLs for LLM input, they’re going to need to establish data deals with more providers anyway.
Picking and choosing those providers could be viewed as censorship, and could lead to other challenges. But they will also lead to more accurate, factual responses from these AI bot tools.