From oil to gemstones: Our shifting understanding of the value of data

Sector: Patent law, Pharmaceuticals

10th October 2025

The concept of data as oil has been around for a number of years, but does the analogy still hold? In the pharma and biotech industry, there is now a shift away from thinking of data as a bulk commodity of raw material, towards the pursuit of high quality data that can improve the performance of AI models. This shift has important implications for IP strategy and licensing provisions relating to data.

Originally posted on IPKat.

Data is a very broad term. At a very basic level, data can be defined as a collection of facts, figures, and statistics. As the Google Books Ngram Viewer reveals, the term “data” only really came to prominence in the 1990s, followed by a steep rise in usage up to the early years of the millennium. The increased use of the term data correlated with the vast increase in information that became available to society from the 90s onwards, brought about partly by the internet and partly by technological advances in computing and scientific discovery (such as genetic sequencing technologies).

The concept of data as “the new oil” was brought to popular attention by a headline in The Economist back in 2017 that proclaimed that The world’s most valuable resource is no longer oil, but data. Interestingly, since then, much of the discussion about data and its use in the economy has often focused on the negative connotations of that phrase, particularly the exploitation of personal information by big tech. However, for the science and technology sector that the IP industry serves, the analogy of data as oil holds a different meaning. The explosion in the scale and complexity of information that arose at the start gave rise to its own field of big data (which became a buzzword around 2007) and associated specialisms devoted to analysing and processing this data(anyone remember systems biology?).

Importantly, this big data provided an essential bedrock for the AI systems which were to follow. The modern field of machine learning is fundamentally dependent on vast amounts of data to learn, improve, and make accurate predictions. There would be no Nobel-prize winning AlphaFold without the public data bank of protein sequences (Evolve Insights), and there would be no Large Language Models (LLMs) without the vast quantity of language data available on the internet. These huge repositories of information were essential to train the models that are the foundation of modern AI.

The different types of data

The data that has fuelled the field of AI in the form of training data takes multiple forms, spanning numerous formats and scientific domains, depending on the type of AI. For early visual AI models like Convolutional Neural Networks (CNNs), the primary data consists of images. This could include, for example, medical imagery such as breast scans and MRI scans used for detecting cancer, as well as visual, infrared, and radar images collected from aerial surveillance for applications like disaster management. Importantly, the generation of AI models used in scientific discovery relies on vast datasets such as genomic and proteomic information from gene sequencing, data from the Protein Data Bank containing protein structures, and simulation results from semiconductor chip design. For generative AI like LLMs, the training data is predominantly text and code, often scraped from the public internet.

Beyond the initial training data, another crucial category of data in machine learning revolves around the performance and refinement of the AI models themselves. This includes model optimization data, such as the human feedback gathered during techniques like Reinforcement Learning from Human Feedback (RLHF). In this process, human reviewers rank AI-generated responses or identify errors, and this feedback is then used as data to update and fine-tune the model. The output of the models also constitutes a key data type, which can range from a ranked list of documents generated by a prior art search tool to the predicted 3D structures of proteins produced by AlphaFold. Finally, statistics about a model’s effectiveness are a critical form of data used to evaluate performance. For instance, research comparing the accuracy of machine learning systems against human doctors provides quantitative data on the model’s utility.

Searching for the data gemstones

We can all accept that big data has been the oil that has fuelled the engine of AI. In the same way an engine cannot run without fuel, AI models are powerless without vast quantities of information to train on. However, the analogy begins to break down when we consider quality. Unlike crude oil, which is a relatively uniform commodity, there is a vast and often unappreciated difference in the type and quality of data available.

However, the persisting popularity of the “data as oil” analogy leads to a persistent misconception that all data is valuable. It is true that, in the early days of big data, the focus was predominantly on volume. However, we now have so much data that a lot of it is not only worthless but can actively harm an AI model by introducing and exacerbating irrelevant noise and biases. Just one example is the patent data used to train LLMs. Google Patents contains a huge number of badly written patent applications (and granted patents). As a result, training an LLM on more of this data is unlikely to improve its performance for drafting patent applications.

Consequently, AI software developers are refocusing their efforts on mining data of quality. Cleaner and more relevant datasets are becoming increasingly valuable, and there is consequently an increasing focus on how these can be obtained. In other words, we are now looking for the data gemstones. A gemstone is rare, precisely formed, and valuable. For an AI model for predicting tumours, a data gemstone might take the form of a curated medical dataset where thousands of MRI scans have been meticulously annotated by multiple expert radiologists to identify tumours. For companies developing self-driving cars, valuable data is no longer another million miles of uneventful motorway driving. Instead, it is the rare video footage of a near-miss accident, a complex intersection in heavy rain, or a child running into the road. These high-quality, often rare, data points are the key to improving the performance of AI models so that they can perform more expert tasks, with greater accuracy.

The shift in the focus from data as oil to searching for the data gemstones is just as applicable to data relating to model optimization and performance. There is a similar shift in focus towards identifying the rare, high-value pieces of performance data that can improve the quality of a model’s output. In the context of LLMs, for example, hallucinations in highly technical areas are unlikely to be identified and ironed out with a sledgehammer approach using more of the same generalist data similar to that which the model was trained on. Instead, the focus is now on expert annotation and optimization using Reinforcement Learning from Human Feedback (RLHF), and in developing methods whereby errors and hallucinations generated by the models can be identified.

Implications for IP strategy

The shift from data as a bulk commodity to curated gemstones of data has important implications for IP strategy. The value of data currently lies not in the raw information itself, but in the intellectual effort, investment, and expertise applied to curate, annotate, and structure the data. This is also where protectable IP can and does reside.

The most relevant IP for data include database rights, copyright, and trade secrets. In the UK and EU, sui generis database rights protect the substantial investment made in obtaining, verifying, or presenting the contents of a database. Database rights can protect the investment put into creating valuable data, for example, by funding an extensive clinical trial and analysing the results with bioinformatic techniques, or by employing experts to annotate thousands of images.

Additionally, in many cases, the most powerful protection lies in treating these high-value datasets as trade secrets. A curated dataset derives its economic value from not being generally known and can be subject to reasonable steps to keep it secret. This is not an entirely new concept. The pharmaceutical industry, for example, has long treated its high-quality clinical trial data, compound libraries, and proprietary assay results as fiercely protected crown jewels.

However, a critical point for businesses to grasp is that the act of identifying, cleaning, annotating, and structuring data for AI model training is a significant value-add activity in and of itself. Many non-tech companies, including pharma companies, may be sitting on vast repositories of raw data but lack the internal expertise to refine the data so as to make it useful for training an AI model. This disconnect could create a significant risk that a company may undervalue their assets in collaborations with tech partners or, on the flip side, why an early partnership with an AI company may be necessary to extract the true value from the data.

Implications for licensing

From an IP licensing perspective, it is important to capture the nuance between raw low-quality and processed, annotated data in licence and collaboration agreements. A standard data licence that grants broad rights to a dataset for a nominal fee is no longer fit for purpose. Agreements must now be far more sophisticated and use precise definitions to clearly delineate what constitutes the licensed curated data as distinct from any raw inputs. Furthermore, contractual terms must address the outputs and applications of the data. This includes asserting ownership or at least rights of access to any improvements, models, or insights derived from the licensed data. Concurrently, strict field of use restrictions are essential to limit the licensee’s application of the data to a specific purpose, preventing them from using the asset to train other models that could compete with the licensor’s core business. Finally, the valuation of the licence must be recalibrated. The licence fees, royalties, or equity stake should reflect the true value of the curated data as a critical enabling asset, not merely as a cost-of-goods-sold commodity.

Final thoughts

The phrase “data is the new oil” remains relevant to our understanding of how today’s modern AI models were developed. However, in a supply-led economy, the sheer amount of data that we have now, together with the fact that most of it is of low quality, has decreased the value of generic new data as a bulk commodity that can be mined from the masses. IP strategy needs to recognise this shift and move the focus to protecting the data that possesses the true value, the rubies, emeralds and sapphires that represent the new data gemstones.

The law in the pharma sector field is also constantly evolving. Understanding the case law trends when drafting, prosecuting and defending these cases is therefore paramount. In our second post on EPO pharma case law trends in 2025 (see Evolve Insights), we review the most impactful decisions of the year relating to clinical-stage inventions.

Sufficiency at the priority date: A study protocol is not “the same” as a therapeutic effect invention (T0883/23)

31st October 2025

Therapeutic inventions are generally not considered sufficiently disclosed absent supporting data. The recent decision in T 0883/23 found that this applies both at the priority date and the filing date of the patent.

Patentee’s own post-published data undermines the credibility of their broad cat antibody patent (T 0709/23)

27th October 2025

How early is too early to file a biotech patent? EPO decision T 0709/23 provides a costly answer, demonstrating the fatal risks of claiming a broad therapeutic use before the link between structure, function, and actual effect is truly understood.

Why patents matter: Understanding the importance of IP in the pharma industry

23rd July 2025

Patents are core to the pharma industry. How does this translate into how the pharma industry thinks about IP?

The morality (and patentability) of inventions derived by immoral means (T 2510/18)

25th June 2025

Should patents be granted for inventions born from unethical practices, even if they offer significant societal benefits?

Freedom to operate versus patentability in biotech: What the difference is and why it matters

9th June 2025

Discover the critical difference between patentability and freedom to operate in biotech, and why true innovation is your best strategy for navigating the complex intellectual property landscape.

Commercial success is a nothing-burger for the EPO in Wegovy patent inventive step analysis (T 1701/22, Obesity treatment with semaglutide)

9th May 2025

Novo Nordisk’s recent setback with the European Patent Office (EPO) for semaglutide offers a critical lesson: even revolutionary drugs face patent rejection if their underlying composition, dose, and pH are deemed obvious in light of prior art, proving that commercial success alone cannot ensure patentability.

The plausibility of diagnostic inventions: No data, no patent (T 0589/22)

30th April 2025

Navigating the intricacies of patenting diagnostic and medical inventions can be challenging, but a recent EPO Board of Appeal decision offers crucial guidance: clear patient population definitions and robust supporting data in the initial application are paramount for establishing “plausibility” and ensuring patent validity.

PHARMACEUTICAL IP

From oil to gemstones: Our shifting understanding of the value of data

The different types of data

Searching for the data gemstones

Implications for IP strategy

Implications for licensing

Final thoughts

Further reading

Related insights...

Patenting the use of medical devices (T0941/24)

Mechanistic insights supporting the sufficiency and inventive step of a therapeutic use (without clinical data) (T 1601/22)

EPO pharma case law trends 2025: Clinical inventions