TL;DR

AI companies are confronting a data bottleneck as researchers project that high-quality public web text could be fully used by frontier training runs between 2026 and 2032. Copyright litigation, licensing deals and demand for expert, enterprise and sovereign datasets are making legally usable data more expensive and harder to copy.

AI companies are facing a sharper data constraint in 2026 as high-quality public web text approaches projected limits and major copyright settlements and licensing deals make training material harder and more expensive to use, according to the Control Series source material and research it cites. The shift matters because models and chips are becoming easier to buy or rent, while unique, verified datasets remain difficult to replicate.

Epoch AI estimates that the public internet contains roughly 300 trillion tokens of high-quality text, and the source material says frontier model training sets are already nearing that ceiling. Epoch projects that the stock of public human text could be fully used between 2026 and 2032, with a median estimate around 2028. The timing is a projection, not a settled endpoint, and it can shift as training methods change.

AI firms are trying to fill the gap with synthetic data and more efficient training. The source material cites Nvidia’s $320 million purchase of synthetic-data company Gretel and Microsoft’s use of hundreds of billions of synthetic tokens. But synthetic data has a known risk: when outputs are hard to verify, errors can compound across model generations, raising the value of fresh, checked human data.

The legal market is changing at the same time. Anthropic agreed to a $1.5 billion settlement with authors over alleged use of pirated books, covering about 500,000 works at roughly $3,000 per title, according to the source material. The settlement addressed past piracy claims and required destruction of the pirated files, but it did not settle future training practices or model-output disputes. The New York Times case against OpenAI remains active in discovery, while publishers including News Corp have moved toward licensing arrangements.

AI Dispatch · The Control Series · Part 3

Chokepoint 03 — Data

Data: The One Thing You Can’t Rent

The free part of “all human knowledge” is running out. As compute and models commoditize, the corpus you can’t replicate becomes the moat — so data is being fenced, priced, and, in places, treated as a national asset.

Scarcity & value rises ↑

Sovereign / real-world

Avengers combat data · FSD · ISR

can’t be bought

Expert-authored

PhDs, lawyers, surgeons define “good”

the new gold

Licensed content

paywalled, deal-only — now priced

fenced

Public web text

scraped for free — exhausting ~2028

commoditizing

~300T

public text tokens — used up 2026–2032

$1.5B

Anthropic authors settlement — scraping era ends

$14.3B

Meta for 49% of Scale — triggered an exodus

keep the model

Ukraine’s condition — data as sovereign asset

The take

Data was supposed to be the abundant input. It’s the scarce one. It’s also the chokepoint you can actually own — so guard your proprietary data, and don’t hand it to a provider who can become your competitor (the lesson everyone fled Scale to learn). Nations: license it like Ukraine — keep the model, keep the leverage.

Sources: Epoch AI; PBS; Intl AI Safety Report 2026; NPR; Authors Guild; Wolters Kluwer; TechCrunch; TIME; CNBC; Ukraine MoD (2024–Jun 2026). Token estimates are projections; valuations as reported.

thorstenmeyerai.com · 03 / 06

Data Ownership Shapes AI Power

Data was once treated as the abundant input in AI development. The current fight shows that legally usable, high-quality data is becoming a scarce asset, especially when it sits behind paywalls, inside companies, in expert workflows or in sovereign systems.

That scarcity may favor large AI firms that can afford licensing, settlements and exclusive data partnerships. A $1.5 billion settlement or a large publishing deal is manageable for well-funded incumbents but can be a barrier for startups that lack the same capital.

For businesses, the practical issue is control. Enterprise data can improve AI systems, but sharing it with a provider may also help a future competitor. The source material frames the lesson plainly: organizations that own valuable datasets need to decide when to license them, when to hold them back and when to require that trained models remain under their control.

Synthetic Data Generation: A Beginner’s Guide

View Latest Price

As an affiliate, we earn on qualifying purchases.

Scraping Gives Way To Licenses

For much of the generative AI boom, the operating model was to scrape broad public data first and address legal disputes later. That model is now under pressure from lawsuits, settlements and paid licensing. A judge in the Anthropic case drew a line between legally acquired books used for training and pirated books downloaded from shadow libraries, according to the source material.

The type of data AI labs want is also changing. Earlier machine-learning systems relied heavily on low-cost labeling work for broad classification tasks. Reasoning models and reinforcement-learning systems need more expert judgment from lawyers, scientists, physicians and other specialists who can define what a good answer looks like in high-value fields.

Real-world data is also becoming strategically sensitive. The source material points to autonomous-vehicle data, intelligence and battlefield data as examples that cannot simply be bought from a public market. It also cites Ukraine’s position that combat data used for AI should keep the resulting model under Ukrainian control, treating data as a national asset rather than a raw export.

“keep the model”
— Ukraine’s reported condition for combat-data use

Synthetic Data Generation: Creating privacy-safe datasets for AI training and data innovation for responsible machine learning (English Edition)

View Latest Price

As an affiliate, we earn on qualifying purchases.

Courts And Limits Still Open

It is not yet clear exactly when public human text will hit a functional limit for frontier training. Epoch AI’s range runs from 2026 to 2032, and the date can move depending on model size, training efficiency, duplication, synthetic data quality and new sources of verified human content.

Several legal questions also remain unresolved. The Anthropic settlement addressed past piracy claims, but it did not decide how future training should be licensed or how courts will handle claims about model outputs. The New York Times case against OpenAI and other disputes may still affect the rules for using copyrighted material in model training.

The durability of synthetic data is another open issue. It may help labs reduce dependence on public web text, but researchers have warned that machine-generated data can degrade model quality when errors are repeated and amplified.

Amazon

human verified datasets for AI

View Latest Price

As an affiliate, we earn on qualifying purchases.

Licensing Terms Drive Competition

AI companies are expected to keep pursuing licensed publisher content, expert-authored data, enterprise partnerships and sovereign data agreements. Those deals will set practical boundaries for who can train competitive models, what data can be reused and who controls the systems built from restricted datasets.

Companies with valuable internal data will face more pressure to set clear rules before connecting proprietary systems to outside AI providers. Courts, publishers and government agencies will shape the next phase as pending cases move forward and more data owners decide whether to sell access, restrict it or demand ownership rights over resulting models.

Copyright and E-learning: A guide for practitioners

View Latest Price

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the actual development?

The development is the growing 2026 shift from freely scraped public data toward licensed, proprietary, expert and sovereign datasets as AI training material becomes harder to obtain legally and at scale.

Does this mean AI models cannot improve?

No. AI models can still improve through better algorithms, synthetic data, expert feedback and specialized datasets. The point is that the easiest source of large-scale public text appears to be nearing practical limits for frontier training.

Why can’t companies just use synthetic data?

Synthetic data can reduce pressure on human-made datasets, and major AI companies already use it. The risk is that machine-generated material can repeat or amplify errors, especially in domains where correct answers are hard to verify.

Why does this matter for businesses?

Business data can become a bargaining chip or a competitive risk. Firms that share proprietary data with AI vendors may improve their tools, but they may also help train systems that weaken their own advantage.

What should readers watch next?

Watch the outcome of pending copyright cases, the size and exclusivity of publisher licensing deals, and whether enterprises or governments require that models trained on their data remain under their control.

Source: Thorsten Meyer AI

Data: The One Thing You Can’t Rent

Up next

Jelly Roll’s wife Bunnie Xo breaks silence on divorce, gets emotional about final fight

Author

The Happy Loved Life Team

Share article

Data: The One Thing You Can’t Rent

Data Ownership Shapes AI Power

Synthetic Data Generation: A Beginner’s Guide

Scraping Gives Way To Licenses

Synthetic Data Generation: Creating privacy-safe datasets for AI training and data innovation for responsible machine learning (English Edition)