robots.txt for AI: Rolling Out the Red Carpet (or Slamming the Door)
Most sites run default platform robots.txt with zero AI-specific rules. That's not a strategy -it's an accident. Explicit Allow rules for GPTBot, ClaudeBot, and PerplexityBot signal that your content is open for citation.
Part of the AEO scoring framework - the current 48 criteria that measure how ready a website is for AI-driven search across ChatGPT, Claude, Perplexity, and Google AIO.
Quick Answer
Add explicit Allow rules in your robots.txt for AI crawlers: GPTBot, ClaudeBot, PerplexityBot, and Google-Extended. Without these rules, AI systems don't know if you want to be cited. Block scrapers you don't want (Bytespider) with Disallow. This takes 5 minutes and it's one of the 48 criteria we score in every audit.
Audit Note
In our audits, we've measured robots.txt for AI: Rolling Out the Red Carpet on live sites, we've compared implementations, and we've audited the gaps...
How do I set up robots.txt to allow AI crawlers like GPTBot and ClaudeBot?
robots.txt is a text file at your domain root (example.com/robots.txt) that tells crawlers what they can access.
Should I block or allow AI bots in my robots.txt file?
We check robots.txt on every audit.
What AI crawlers exist and which ones should my site allow access to?
Add these rules to your robots.txt: ``` # AI Crawler Policy -Explicitly allow AI systems User-agent: GPTBot Allow:...
Summarize This Article With AI
Open this article in your preferred AI engine for an instant summary and analysis.
Before & After
Before - Default platform robots.txt
User-agent: * Disallow: /admin/ Sitemap: https://example.com/sitemap.xml # No AI-specific rules at all
After - Explicit AI crawler policy
User-agent: GPTBot Allow: / Crawl-delay: 2 User-agent: anthropic-ai Allow: / Crawl-delay: 2 User-agent: Bytespider Disallow: /
Which AI Crawlers Are Trying to Access Your Site?
robots.txt is a text file at your domain root (example.com/robots.txt) that tells crawlers what they can access. Simple concept. But the crawler landscape has shifted dramatically -there's now a whole category of AI-specific bots that collect content for training and retrieval.
Here's who's knocking: - GPTBot -OpenAI (ChatGPT, GPT-based products) - CCBot -Common Crawl (feeds into many AI training sets) - Google-Extended -Google (Gemini, AI Overviews) - PerplexityBot -Perplexity AI - Anthropic-AI -Anthropic (Claude) - Bytespider -ByteDance (TikTok's AI features)
Your robots.txt can explicitly Allow or Disallow each one. That's granular control over which AI systems can use your content -and which can't.
Why Is Having No AI Crawler Policy the Worst Option?
We check robots.txt on every audit. Here's what we find 80% of the time: the default robots.txt from Shopify, WordPress, or whatever platform the site runs on. Zero AI-specific rules. Zero intentionality.
This creates two problems:
First -no signal of AI-friendliness. When you explicitly Allow AI crawlers, you're telling these systems "my content is available and welcome for citation." That's a signal. Default configs send no signal at all.
Second -no control. Without explicit rules, you're leaving it up to each bot's default behavior. Some crawl everything. Some play it safe and skip you. You have no say.
For AEO, the strategic play is clear: Allow AI crawlers on content you want cited (blog posts, product pages, FAQ, knowledge base) and block areas that don't need indexing (admin panels, checkout flows, internal tools).
This criterion carries 2% weight in the Technical Plumbing tier of our scoring model. But 2% of zero effort is still free points. And we've seen sites miss even this.
How Do You Configure robots.txt for AI Crawlers?
Add these rules to your robots.txt:
``` # AI Crawler Policy -Explicitly allow AI systems User-agent: GPTBot Allow: / Crawl-delay: 2
User-agent: CCBot Allow: / Crawl-delay: 2
User-agent: Google-Extended Allow: /
User-agent: PerplexityBot Allow: / Crawl-delay: 2
User-agent: anthropic-ai Allow: / Crawl-delay: 2
User-agent: Bytespider Disallow: / ```
Shopify: Edit robots.txt.liquid in your theme (Online Store > Themes > Edit code > Templates > robots.txt.liquid).
WordPress: Use Yoast SEO to edit robots.txt, or edit the file directly in your root directory.
Next.js / static sites: Create a public/robots.txt file or generate it dynamically via an API route. Our site generates it statically.
The Crawl-delay: 2 asks bots to wait 2 seconds between requests. Polite and prevents server load issues.
Start here: Open yoursite.com/robots.txt right now. If you don't see GPTBot or ClaudeBot mentioned -you've got work to do. Five minutes of work.
What robots.txt Mistakes Should You Avoid?
Blocking all AI crawlers. We've seen this -sites that blanket-block every AI bot thinking they're "protecting their content." The result? Complete AI invisibility. Nobody's citing you. Nobody's recommending you. That's not protection -that's disappearance.
Forgetting robots.txt is advisory, not enforced. Well-behaved bots follow it. Malicious scrapers don't. It's not a security measure -it's a communication tool.
Missing the Sitemap reference. Always include Sitemap: https://yoursite.com/sitemap.xml in your robots.txt. It's the roadmap that makes crawling efficient.
Overly broad Disallow rules. Disallow: /blog when you meant to block /blog/drafts -now your entire blog is invisible to AI. Be specific.
Not testing after changes. A typo in robots.txt can accidentally block your entire site. Use Google's robots.txt tester before deploying changes.
Platform limitations. Shopify's robots.txt is partially platform-managed. Know what you can and can't control on your stack.
Score Impact in Practice
The AI Crawler Directives criterion carries 2% weight in the Technical Plumbing tier. Sites with explicit Allow rules for AI crawlers score 7-9/10 on this criterion. Sites with default platform robots.txt files that contain no AI-specific rules score 3-4/10. Sites that actively block AI crawlers score 0/10.
In practice, robots.txt is one of the lowest-effort, lowest-risk criteria to max out. It takes 5 minutes to add the rules, there's virtually no downside, and it signals intentionality to every AI engine that checks. Despite this, roughly 80% of the sites we audit have no AI-specific rules in their robots.txt.
Among Y Combinator startups we've benchmarked, adoption is slightly higher - about 30% have explicit AI crawler rules. But even in this technically sophisticated cohort, the majority run default platform configs. The sites that do include AI directives tend to score higher across all technical criteria, not because robots.txt directly improves other scores, but because attention to AI crawlers correlates with attention to crawlability in general.
How AI Engines Evaluate This
Each AI crawler checks robots.txt before crawling any page on your site. The behavior on finding (or not finding) specific rules varies by engine.
GPTBot (OpenAI) respects robots.txt strictly. If your robots.txt has no GPTBot-specific rule, GPTBot falls back to the general User-agent: * rules. If those rules allow access, GPTBot will crawl - but without an explicit Allow, OpenAI's systems treat your content access permission as ambiguous. An explicit User-agent: GPTBot / Allow: / removes that ambiguity and signals that you welcome AI indexing.
ClaudeBot (Anthropic) checks for both anthropic-ai and ClaudeBot user-agent strings. Anthropic has been particularly careful about respecting opt-out signals. If your robots.txt blocks either user-agent string, Claude will not use your content for responses. The flip side: an explicit Allow for anthropic-ai is a positive signal that feeds into Anthropic's source confidence scoring.
PerplexityBot checks robots.txt and also looks for a Crawl-delay directive. Perplexity's crawler is high-frequency because it builds answers in real time, so the Crawl-delay value matters more here than for training-focused crawlers. Setting Crawl-delay: 2 prevents server overload while keeping the door open for citation.
Google-Extended is the user-agent for Google's generative AI features (Gemini, AI Overviews). It's separate from Googlebot (which handles traditional search). You can allow Googlebot for search indexing while blocking Google-Extended for AI training, or vice versa. Most sites benefit from allowing both, but the distinction gives you granular control if you want AI Overviews visibility without contributing to Gemini's training data.
External Resources
Key Takeaways
- Add explicit Allow rules for GPTBot, ClaudeBot, PerplexityBot, and Google-Extended - without them, AI does not know you want to be cited.
- Block scrapers you do not trust (like Bytespider) with Disallow rules while keeping citation-driving crawlers open.
- Always include a Sitemap reference in your robots.txt so crawlers can navigate your site efficiently.
- Remember robots.txt is advisory, not enforced - it is a communication tool, not a security measure.
How does your site score on this criterion?
Get a free AEO audit and see where you stand across all 34 criteria.