title: “A Vocabulary for Controlling Usage of Content Collected by Search and AI Crawlers” abbrev: “displaybasedpref” category: info
docname: draft-madhavan-aipref-displaybasedpref-latest submissiontype: IETF # also: “independent”, “editorial”, “IAB”, or “IRTF” number: date: consensus: true v: 3 area: “Web and Internet Transport” workgroup: “AI Preferences” keyword:
author: - fullname: “K. Madhavan” organization: Microsoft Corporation email: krishna.madhavan@microsoft.com
- fullname: “F. Canel” organization: Microsoft Corporation email: fabrice.canel@microsoft.com
- fullname: “J. Gimbel” organization: Microsoft Corporation email: jordangimbel@microsoft.com
- fullname: “S. Cooper” organization: Microsoft Corporation email: sonia.cooper@skype.net
normative:
informative:
— abstract
This document proposes a standardized vocabulary to express preferences for usage of digital content collected by Search and AI crawlers. This vocabulary allows for the creation of structured declarations about restrictions or permissions for use of content retrieved by such systems.
— middle
This document defines a common vocabulary of terms for search and AI systems that process digital content. The primary purpose of this vocabulary is to enable machine-readable expressions of preferences about using digital content collected by Search and AI crawlers.
The terms defined by the vocabulary can be used to describe, in a standardized way, the types of uses that a declaring party may wish to explicitly restrict or allow. Preferences are then expressed as a grant or denial of permission concerning each of the types of use defined in the vocabulary. This ensures that preferences can be communicated, processed, and stored in a consistent and interoperable manner.
The vocabulary or the preferences that might be expressed do not proscribe how automated processing systems obtain or act on preferences. Separate documents will describe how preferences might be associated with digital content. It is designed to ensure that preference information can be exchanged between different systems and consistently understood. A reader will also find that this document identifies existing implementations of certain vocabulary elements, helping readers connect these concepts to current preferences supported by most search engines and AI solutions. The authors anticipate removing the references to existing implementations in the final version.
Expressing preferences is without prejudice to applicable laws including the applicability of exceptions and limitations to copyright.
{::boilerplate bcp14-tagged}
A crawler is an automated program that scans the web, collecting content (web pages, images, documents etc.) or availability status per URI scanned.
The vocabulary is a set of categories, each of which is defined to cover a class of usage for digital content. The section on vocabulary defines these categories in more detail.
A statement of preference is made about a specific digital content. Statements of preferences can assign preferences to each of the categories of use in the vocabulary.
A statement of preferences can express preferences about some, all, or none of the categories from the vocabulary. This can mean that no preference is expressed for a given usage category.
In the absence of a statement of preference, no preference is set.
TODO Conformance
This specification provides a set of definitions for different categories of use based on expressed display preferences.
This specification does not provide any enforcement mechanism for those preferences, and conformance to it does not encompass whether preferences are actually respected during data processing.
Preferences do not themselves create rights or prohibitions, either in the positive or the negative. Other mechanisms—technical, legal, contractual, or otherwise—might enforce stated preferences and thereby determine the consequences of following or not following a stated preference.
An entity that receives usage preferences MAY choose to respect those preferences it has discovered, according to an understanding of how the asset is used, how that usage corresponds to the usage categories where preferences have been stated, and the applicable legal context.
Usage preferences can be ignored due to express agreements between relevant parties, explicit provisions of law, or the exercise of discretion in situations where widely recognized priorities justify doing so. Priorities that could justify ignoring preferences include - but are not limited to - free expression, safety, education, scholarship, research, preservation, interoperability, and accessibility.
Because enforcement is not provided by this specification, the consequences of ignoring preferences could vary depending upon how a given legal jurisdiction recognizes preferences.
The following definitions apply to content collected by search and AI crawlers. It does not include user-initiated access of content. All these categories apply independently of each other with the most restrictive taking precedence in case all/some categories are present.
The act of allowing or disallowing content collected by web crawlers from being indexed or retrieved for purposes of display. Such preference mechanism can also be applied for cases where digital content is not accessible. In existing implementations, access preferences are typically expressed via the NOINDEX statement set in HTTP header or meta tags.
The act of allowing or disallowing a reproduction of text content collected by a web crawler, except for the title if specified, from the whole or parts of the content to display portions of that content. In existing implementations preference on which text can be used for caption are expressed via the NOSNIPPET statement set in http header, HTML meta tags, or HTML tags properties (data-nosnippet).
The act of limiting the number of characters as a textual display from content collected by a web crawler. In existing implementations quotation length preferences are expressed via the max-snippet statement set in http header or HTML robots meta tags.
The act of limiting text content to only an exact match if displaying text content from the document. If this preference is present, text content must be quoted as is or use avoided and an explicit link back to the source of the document used in that instance. One example of existing implementation of text quotation preferences is notranslate.
The act of limiting usage and size of images. In existing implementations image preview preferences are typically expressed via the max-image-preview statement set in http header or HTML meta tags.
The act of limiting usage and length of videos. In existing implementations video preview preferences are typically expressed via the max-video-preview statement set in http header or HTML robots meta tags.
The act of using content in training general purpose AI models that have the intent to generate text, images or other forms of synthetic content, or the act of training more specialized AI models that have the purpose of generating text, images or other forms of synthetic content. In existing implementations preferences are communicated via robots.txt or via http header or HTML robots meta tags.
The vocabulary is used by referencing the terms defined in the section on vocabulary, directly or via mappings, in accordance with how they are defined in this document.
Extensions to the vocabulary might define more specific categories of usage. Preferences about more specific categories override those of any more general category.
Statements of preferences are general purpose, machine-readable statements that cannot override contractual agreements or more specific statements.
For instance, a statement of preferences might indicate that the use of a digital content is disallowed for Generative AI Training. If arrangements, such as legal or business agreements, exist that explicitly permit the use of that digital content, those arrangements are likely to apply, unless the terms of the arrangement explicitly say otherwise.
Each usage category in the vocabulary is mapped to a short textual label. The table below tabulates this mapping.
Category | Label | Reference |
---|---|---|
Indexing and retrieval | index | indexing |
Display text | display-text | display-text |
Display text length | max-text-length | display-text-length |
Exact text match | match-text | exact-text-match |
Image preview | max-image-preview | image-preview |
Video preview | max-video-preview | video-preview |
Generative AI training | train-genAI | gen-ai-training |
An important note about this process and format is that, if the same key appears multiple times, only the last value is taken. This means that duplicating the same key could result in unexpected outcomes.
If the expression contains an explicit preference, that is the result.
Otherwise, no preference is expressed.
The application might have multiple preference expressions, obtained using different methods.
If multiple preference expressions are active, all preference expressions are consulted as described in the section on Applicability and Legal Effect. This might result in conflicting answers.
If any preference expression indicates that the usage is restricted, the result is that the usage is restricted.
Otherwise, if any preference allows the usage, the result is that the usage is allowed.
Otherwise, no preference is set.
This process ensures that the most restrictive preference applies.
TODO
This document has no IANA actions.
Category | Search Experience if preference set to disallowed | AI Tool Experiences (such as Chat experience) if preference set to disallowed |
---|---|---|
Indexing and Retrieval | Content is not used or linked in response to a user search query. | Content may not be used or linked in response to a user query. Eg: Response in Copilot to the query, “Tell me what the mayor of SF said last night at city hall?” may not retrieve and use a relevant SF Chronicle article to inform a user response if this preference is set to not allowed. |
Display Text | When content is shown in response to a user query, only the title (if specified) and URL. | Content cannot be used as a direct input to generate an AI experience (such as an AI summary or overview) in response to a user query. When content is shown in response to a user query, only the title (if specified) and URL may be displayed. Eg: Response in Copilot to the query, “Tell me what the mayor of SF said last night at city hall?” may only display the title and URL to a SF Chronicle article if that article is delivered in the response and it will not serve as a direct input for grounding, provided the whole document is set to no display. |
Display Text Length | Any display that includes a portion of the content must comply with the specified character limit. | Any display that includes a portion of the content must comply with the specified character limit. Eg: Response in Copilot to the query, “Tell me what the mayor of SF said last night at city hall?” may use a SF Chronicle article for grounding purposes to generate a response, but any passage of the article that is included as part of that response must comply with any established character limit. The response may go beyond the passage from the content and include other statements or information – whether observations derived from examining the article or not. |
Exact Text Match | Any display that includes a portion of the content must only present the designated portions of the content. | Any display that includes a portion of the content must comply with the specified character limit. Eg: Response in Copilot to the query, “Tell me what the mayor of SF said last night at city hall?” may use a SF Chronicle article for grounding purposes to generate a response, but any passage of the article that is included as part of that response must only include characters from the designated portion of the content. |
Generative AI Training | Any text included cannot be used for training of Generative AI models. | Any text included cannot be used for training of Generative AI models. |
— back
TODO