New Internet Rules Will Block AI Training Bots

New rules will give publishers the ability to block all AI Training bots using currently available standards

SEJ STAFF Roger Montti

November 11, 2024
⋅
5 min read

SEJ STAFF Roger Montti Owner - Martinibuster.com at Martinibuster.com

Bio

96

SHARES
4.0K

READS

New Internet Rules Will Block AI Training Bots

New standards are being developed to extend the Robots Exclusion Protocol and Meta Robots tags, allowing them to block all AI crawlers from using publicly available web content for training purposes. The proposal, drafted by Krishna Madhavan, Principal Product Manager at Microsoft AI, and Fabrice Canel, Principal Product Manager at Microsoft Bing, will make it easy to block all mainstream AI Training crawlers with one simple rule that can be applied to each individual crawler.

Virtually all legitimate crawlers obey the Robots.txt and Meta Robots tags which makes this proposal a dream come true for publishers who don’t want their content used for AI training purposes.

Internet Engineering Task Force (IETF)

The Internet Engineering Task Force (IETF) is an international Internet standards making group founded in 1986 that coordinates the development and codification of standards that everyone can voluntarily agree one. For example, the Robots Exclusion Protocol was independently created in 1994 and in 2019 Google proposed that the IETF adopt it as an official standards with agreed upon definitions. In 2022 the IETF published an official Robots Exclusion Protocol that defines what it is and extends the original protocol.

Caveat

The new draft proposal comes with a caveat:

“This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ‘work in progress.'”

Reason For Proposal: Bring Protocol Up To Date

The purpose of the proposal for updating the Robots.txt protocol is to bring the Robots.txt protocols up to date with the modern reality that there are now AI related bots crawling websites.

The draft explains:

“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288], the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation models. Application developers are requested to honor these tags. The tags are not a form of access authorization however.”

Three Ways To Block AI Training Bots

The draft proposal for blocking AI training bots suggests three ways to block the bots:

Robots.txt Protocols
Meta Robots HTML Elements
Application Layer Response Header

1. Robots.Txt For Blocking AI Robots

The draft proposal seeks to create additional rules that will extend the Robots Exclusion Protocol (Robots.txt) to AI Training Robots. This will bring about some order and give publishers choice in what robots are allowed to crawl their websites.

Adherence to the Robots.txt protocol is voluntary but all legitimate crawlers tend to obey it.

The draft explains the purpose of the new Robots.txt rules:

“While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by [RFC8288], the protocol doesn’t provide controls on how the data returned by their service may be used in training generative AI foundation models.

Application developers are requested to honor these tags. The tags are not a form of access authorization however.”

An important quality of the new robots.txt rules and the meta robots HTML elements is that legit AI training crawlers tend to voluntarily agree to follow these protocols, which is something that all legitimate bots do. This will simplify bot blocking for publishers.

The following are the proposed Robots.txt rules:

DisallowAITraining – instructs the parser to not use the data for AI training language model.

AllowAITraining -instructs the parser that the data can be used for AI training language model.

2. HTML Element ( Robots Meta Tag)

The following are the proposed meta robots directives:

<meta name=”robots” content=”DisallowAITraining”>

<meta name=”examplebot” content=”AllowAITraining”>

3. Application Layer Response Header

Application Layer Response Headers are sent by a server in response to a browser’s request for a web page. The proposal suggests adding new rules to the application layer response headers for robots:

“DisallowAITraining – instructs the parser to not use the data for AI training language model.

AllowAITraining – instructs the parser that the data can be used for AI training language model.”

Provides Greater Control

AI companies have been unsuccessfully sued in court for using publicly available data. AI companies have asserted that it’s fair use to crawl publicly available websites, just as search engines have done for decades.

These new protocols give web publishers control over crawlers whose purpose is for consuming training data, bringing those crawlers into alignment with search crawlers.

Read the proposal at the IETF:

Robots Exclusion Protocol Extension to manage AI content use

Featured Image by Shutterstock/ViDI Studio

Category News SEO

Your 2025 Marketing (+ AI) Agency Growth Kit

Winning The Link Game: How To Create & Pitch Content That Attracts Incredible Links

How To Analyze Leads To Improve Marketing Performance

The CMO’s Guide To Winning In AI Search With Ahrefs