All posts

Newsrooms Are Blocking the Internet Archive to Stop AI, and the Web Can't Tell the Difference

On June 1, 2026, the Internet Archive published a public plea to keep the news in its collection.

The Wayback Machine has preserved the web for nearly thirty years and saved more than a trillion pages. It is what journalists, researchers, and courts reach for when a story is updated, a page is quietly edited, or a site goes offline. That record is now being pulled back. By Nieman Lab's count, more than 340 news outlets block the Archive's crawler, including the New York Times, the Guardian, and USA Today. The reason they give is AI.

What is actually happening

Publishers are not accusing the Archive of misusing their work. They block it because they worry someone else will. AI companies are increasingly shut out of scraping news sites directly, and publishers fear those firms will reach the same articles through the Archive's open collections instead. The Guardian told Nieman Lab the Archive's open API was an obvious place for AI firms to plug in. So the Times added the Archive's crawler to its robots.txt in late 2025 and now blocks it outright. Hundreds of regional titles owned by Gannett, McClatchy, Advance Local, and Tribune have done the same.

No publisher has confirmed that an AI company ever scraped its content through the Wayback Machine. The blocking is a precaution against proxy scraping, not a response to a known case. The threat is still theoretical, but the loss to the public record is real.

Why it became all-or-nothing

The web's access controls answer one question: should this crawler be allowed in? robots.txt, IP blocks, and user-agent filters decide based on who is knocking, not on why. There is one lever, allow or deny, and two very different intentions sit behind it: preserve a page once for the record, or ingest everything on it to train a commercial model. To deny the second, a publisher has to deny the first.

It is like closing the public library because someone keeps photocopying the books. The photocopier is the problem, but the library is what gets shut.

The same blunt instrument, again

This is a pattern. It is the same as blocking every bot from a store to stop the fraudulent ones, or treating every AI shopping agent as an attacker when many arrive with a paying customer. When a system cannot tell intentions apart, it over-blocks and loses the good traffic with the bad. The precise option, sorting automation by what it is actually doing, is harder to build, so the web keeps reaching for the blunt one.

It can be built. The Archive now limits bulk downloads and works with Cloudflare to monitor bot activity. Telling crawlers apart by behaviour is exactly the work.

Europe already wrote the distinction into law

The European framework is ahead of the blunt practice. Under Article 4 of the EU's 2019 Copyright Directive, text-and-data mining, which European courts have repeatedly held includes AI training, is allowed unless a rightsholder reserves their work with a machine-readable signal. The EU AI Act then requires general-purpose AI providers to honour those reservations. Article 3 also protects mining by research and cultural-heritage institutions in a way private contracts cannot override.

Together, the European model says what the all-or-nothing block ignores: a publisher can refuse AI training specifically, with a signal machines must respect, without refusing access, search, or preservation. German courts affirmed the principle in the LAION and GEMA v OpenAI cases: training counts as mining, and the machine-readable opt-out is the control point. The licensing deals some publishers have signed with AI firms make the same point through the market. Price the training use, keep the archiving open.

What a better posture looks like

The fix is granularity, and it exists today. State access policy by purpose instead of as a single gate: preserve, search, AI-input, and AI-training as separate, machine-readable declarations. Reserve and license the training right, where the value and the harm sit. Leave archiving and search open, because that is what sends readers and historians back to the source. A newsroom that signals "no training" loses nothing it was being paid for. A newsroom that blocks the Archive loses its own ability to cite, verify, and correct the record later.

Why it matters

A public record that can be edited or deleted with no independent copy is barely a record. Links rot, pages change, and the institutions most able to alter a story are sometimes the ones who published it. The Wayback Machine is the check on that. Withdrawing it to guard against a risk no one has measured trades a permanent loss for a hypothetical one, when better tools for that risk already exist.

If you think the news belongs in the public record, the Internet Archive and Fight for the Future have an open letter to media leaders asking for exactly that.

Frequently asked questions

Why are news publishers blocking the Internet Archive?

Not for anything the Archive did. They fear AI companies, increasingly blocked from scraping news sites directly, will reach their articles through the Archive's open collections, so they block its crawler as a precaution.

Has AI actually scraped news content through the Wayback Machine?

No publisher has confirmed it. Reporting describes the blocking as preemptive, a fear of proxy scraping rather than a documented case.

Why can't publishers block AI but keep the Archive?

Because robots.txt and IP filters decide by which crawler is knocking, not by purpose. There is one allow-or-deny lever, and preservation and AI-training sit behind the same one.

What does EU law say?

The Copyright Directive lets rightsholders reserve their work against text-and-data mining, which includes AI training, through a machine-readable opt-out, and the AI Act requires AI providers to honour it. That is a purpose-specific control that does not require blocking access or preservation.

A note from Kairal

We build the layer that tells automated traffic apart: archival from extractive, customer from scraper, declared agent from disguised one. The answer to a bot does not have to be a wall. The Archive's situation is the same problem the open web keeps hitting. Without classification, the only tool left is the blunt one.