ai.txt & TDM Policy
robots.txt controls crawling. llms.txt describes your content. But neither answers the question AI companies actually care about: "Are we allowed to use this?"
Part of the AEO scoring framework - the current 48 criteria that measure how ready a website is for AI-driven search across ChatGPT, Claude, Perplexity, and Google AIO.
Quick Answer
ai.txt is an emerging standard (like robots.txt for licensing) declaring whether AI systems may use your content for training, retrieval, or citation. The TDM Reservation Protocol uses HTTP headers and meta tags to express reuse permissions in a machine-readable way. Fewer than 5% of sites have either.
Audit Note
In our audits, we've measured ai.txt & TDM Policy on live sites, we've compared implementations, and we've audited the gaps that keep scores low.
What is ai.txt and do I need one for my website?
We're checking whether your site publishes machine-readable declarations about how AI systems are allowed to use your content.
How do I declare whether AI systems can use my content for training?
The AI licensing landscape is evolving fast.
What is the TDM Reservation Protocol and how does it affect AI crawlers?
Policy signals are checked at three levels: domain-wide files, HTTP headers, and per-page meta tags.
Summarize This Article With AI
Open this article in your preferred AI engine for an instant summary and analysis.
Before & After
Before - No AI usage policy
# robots.txt User-agent: * Disallow: /admin/ # No AI-specific rules # No ai.txt file # No TDM headers # AI companies guess your preferences
After - Clear ai.txt with TDM headers
# /ai.txt Training: Disallowed Retrieval: Allowed with Attribution Citation: Allowed with Link # HTTP header on content pages TDM-Reservation: 0
What Do ai.txt and TDM Policy Measure?
We're checking whether your site publishes machine-readable declarations about how AI systems are allowed to use your content. This goes beyond robots.txt (crawling access) and licensing (copyright terms) to address the specific question: "Can AI systems use this for training models, for retrieval-augmented generation, and for citation in answers?"
Three distinct policy mechanisms get checked. First, the ai.txt file -an emerging convention (like robots.txt and llms.txt) at the domain root declaring per-use-case permissions. The format typically specifies policies for training (use for model training?), retrieval (fetch and summarize in real-time responses?), and citation (quote with attribution in answers?).
Second, the TDM Reservation Protocol -a W3C-drafted standard implementing the EU DSM Directive's Article 4, allowing rights holders to reserve text and data mining rights. It uses HTTP headers (TDM-Reservation: 1) and HTML meta tags (<meta name="tdm-reservation" content="1">) to declare that automated mining requires explicit permission.
Third, we check related signals in robots.txt AI crawler directives, llms.txt content usage sections, and terms-of-service pages linked from structured data. The combination creates a composite "AI usage policy clarity" score -how clearly your site communicates its content reuse policy.
Primary metric: "AI policy signal presence" -does the site have any machine-readable AI usage policy? Secondary: "policy completeness" -does it address all three use cases (training, retrieval, citation) with clear permissions or restrictions?
Why Is Having No AI Policy the Worst Option?
The AI licensing landscape is evolving fast. OpenAI, Anthropic, Google, and Meta are building systems that attempt to respect publisher content policies. Sites with clear declarations -permissive or restrictive -get their preferences honored. Sites without policies get treated according to each company's default, which varies and may not match what you want.
For sites wanting maximum AI visibility, a clear permissive policy is strategically valuable. When your ai.txt explicitly states AI systems can retrieve and cite your content with attribution, AI systems checking this policy (and more do each quarter) cite your content more freely. Without this signal, some systems apply conservative defaults limiting how extensively they quote you.
For sites wanting to restrict AI usage, a clear restrictive policy is the only reliable mechanism. robots.txt blocks crawling but doesn't address training or retrieval from cached data. Copyright declarations don't address specific AI use cases. TDM Reservation and ai.txt are the purpose-built tools.
Consistency across mechanisms is critical. An ai.txt saying "retrieval allowed" paired with robots.txt blocking GPTBot sends contradictory signals. AI companies interpret these conflicts differently, leading to unpredictable behavior. We check internal consistency across all signals to ensure your site communicates one clear policy.
How Are ai.txt and TDM Policies Checked?
Policy signals are checked at three levels: domain-wide files, HTTP headers, and per-page meta tags.
At the domain level, we send HEAD and GET requests to conventional ai.txt URLs: /ai.txt, /.well-known/ai.txt, /ai-policy.txt. If found, we parse for structured policy declarations -key-value pairs (Training: Allowed, Retrieval: Allowed with Attribution, Citation: Allowed with Link) and block-based per-company declarations.
We re-examine robots.txt and llms.txt for AI-specific policy content. robots.txt entries for AI crawlers (GPTBot, ClaudeBot, PerplexityBot) are interpreted as crawling policy. llms.txt sections describing usage permissions are interpreted as retrieval/citation policy.
At the HTTP header level, we check responses from a page sample (homepage, content page, product page) for TDM-related headers: TDM-Reservation, X-Robots-Tag with AI-specific directives, and custom headers used by specific AI companies (X-AI-Usage headers observed on some publisher sites).
At the per-page level, we parse HTML for TDM meta tags (<meta name="tdm-reservation" content="1">), Creative Commons meta tags addressing derivative works, and custom AI-usage meta tags.
Then the cross-signal consistency check. We map all detected policy signals -ai.txt, robots.txt, llms.txt, HTTP headers, per-page meta tags -and verify they express consistent permissions. Conflicts (ai.txt allows retrieval but robots.txt blocks AI crawlers) get flagged with specific remediation steps.
How Is ai.txt and TDM Policy Scored?
AI policy scoring evaluates presence, completeness, and consistency:
1. Policy presence (4 points): - ai.txt file exists with parseable declarations: 4/4 points - No ai.txt but clear AI policy in llms.txt or robots.txt AI crawler rules: 3/4 points - TDM Reservation headers or meta tags without ai.txt: 2/4 points - Only generic robots.txt with no AI-specific rules: 1/4 points - No AI-related signals detected: 0/4 points
2. Policy completeness (3 points): - Addresses all three use cases (training, retrieval, citation): 3/3 points - Two of three: 2/3 points - One only: 1/3 points - Present but vague or unactionable: 0/3 points
3. Cross-signal consistency (3 points): - All signals (ai.txt, robots.txt, llms.txt, headers, meta tags) are consistent: 3/3 points - Minor inconsistencies without contradictions: 2/3 points - One contradiction: 1/3 points - Multiple contradictions: 0/3 points
Bonus: - +0.5 points if ai.txt includes per-company policies (different rules for different AI providers) - +0.5 points if TDM policy links to a human-readable terms page
Deductions: - -1 point if robots.txt blocks AI crawlers while other signals suggest permissive policy (direct contradiction) - -0.5 points if ai.txt exists but isn't parseable (malformed, ambiguous) - -0.5 points if TDM Reservation is 1 (opt-out) but no mechanism for researchers to request access
This is one of the newest criteria. Fewer than 5% of sites have an ai.txt file as of early 2026. Most score 0-2. News publishers engaged with AI policy discussions score 4-7. Sites with multi-signal AI policies score 8-10.
Score Impact in Practice
Sites scoring 8+ on ai.txt and TDM policy have all three layers in place: an ai.txt file at the domain root with explicit training/retrieval/citation declarations, consistent robots.txt directives for AI-specific crawlers, and TDM Reservation headers or meta tags on content pages. These sites have made a deliberate strategic choice about AI content usage and communicated it through every available channel. News publishers and large content organizations with legal teams advising on AI policy tend to score highest.
Sites scoring 0-2 represent the vast majority - over 95% of audited sites. They have a generic robots.txt with no AI-specific user-agent rules, no ai.txt file, no TDM headers, and no machine-readable policy of any kind. AI companies treat these sites according to their own internal defaults, which vary between providers and change over time without notice to the publisher.
The jump from 0 to 5 is one of the fastest scoring improvements in the entire audit. Creating an ai.txt file takes under 10 minutes - it is a plain text file with key-value declarations. Adding AI-specific rules to robots.txt (allowing or disallowing GPTBot, ClaudeBot, PerplexityBot) takes another 5 minutes. These two steps alone address the policy presence and completeness sub-scores, moving a site from "no signals" to "clear signals" with minimal technical effort.
Where Sites Lose Points
The most penalized mistake is contradictory signals across policy layers. A robots.txt that blocks GPTBot and ClaudeBot combined with an ai.txt declaring "Retrieval: Allowed" sends an impossible message - you are simultaneously blocking AI crawlers from accessing your content and telling them they are allowed to retrieve it. AI companies encountering this contradiction may choose either interpretation, leading to unpredictable behavior.
Ambiguous or unstructured ai.txt files that exist but are not machine-parseable score worse than expected. An ai.txt containing a free-form paragraph like "We welcome AI systems to cite our content responsibly" is human-readable but not machine-actionable. AI systems need structured key-value pairs with recognized field names (Training, Retrieval, Citation) and clear values (Allowed, Disallowed, Allowed with Attribution).
Blocking all AI crawlers in robots.txt without a corresponding policy rationale is a common overreaction. Some site owners add blanket Disallow: / rules for every AI user-agent without understanding the tradeoffs. This blocks AI retrieval for citation purposes - meaning your content will not appear in ChatGPT or Perplexity responses even when users ask about topics you cover. If the intent is to block training only, the policy should permit retrieval while restricting training.
Missing TDM Reservation headers on sites with restrictive policies leave a gap in the EU compliance layer. The W3C TDM Reservation Protocol is specifically designed for the European regulatory context. Sites targeting European audiences with restrictive AI policies should implement TDM headers to ensure compliance-focused AI systems respect their preferences.
How AI Engines Evaluate This
ChatGPT (via GPTBot) checks robots.txt for crawl permissions and is beginning to check ai.txt and similar policy files for usage guidance. When GPTBot is blocked in robots.txt, ChatGPT's browsing feature cannot retrieve content from that domain. When GPTBot is allowed and an ai.txt permits retrieval with attribution, ChatGPT can browse, retrieve, and cite the content in real-time responses. The distinction between training and retrieval permissions is critical - many publishers want citation visibility but not training data contribution.
Perplexity (via PerplexityBot) respects robots.txt directives and evaluates publisher licensing signals when determining citation behavior. Sites with permissive retrieval policies may see more extensive quoting in Perplexity's cited answers. Sites blocking PerplexityBot are excluded from Perplexity's real-time retrieval entirely, losing visibility across all queries where Perplexity would have cited them.
Claude (via ClaudeBot) follows robots.txt directives strictly. Anthropic's approach to publisher preferences is to respect the most restrictive signal available. If robots.txt blocks ClaudeBot, that takes precedence regardless of other signals. If ClaudeBot is permitted and an ai.txt exists, the policy declarations inform how content may be used. This makes signal consistency especially important - conflicting signals default to the most restrictive interpretation.
Google's AI systems evaluate TDM Reservation headers as part of their content rights assessment for AI Overviews and Gemini responses. Sites with clear TDM declarations interact more predictably with Google's content usage policies. The TDM protocol is the most legally grounded mechanism for European publishers and is gaining recognition across all major AI providers as a standardized policy declaration format.
Resources
Key Takeaways
- An ai.txt file explicitly declares whether AI systems may train on, retrieve, or cite your content.
- No policy is the worst policy - sites without declarations get treated according to each AI company's unpredictable defaults.
- Keep all signals consistent - ai.txt, robots.txt, llms.txt, and TDM headers should express the same permissions.
- Fewer than 5% of sites have ai.txt as of 2026 - adding one is a quick win for AI policy clarity.
How does your site score on this criterion?
Get a free AEO audit and see where you stand across all 34 criteria.