What changed in 2024–2026 with AI crawlers
Until 2023, robots.txt was mostly about telling Google and Bing what not to crawl for search. Today, a new generation of crawlers exists: bots that scrape the web specifically to train large language models. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), Google-Extended (Gemini), Applebot-Extended (Apple Intelligence), Bytespider (ByteDance), CCBot (Common Crawl, used by many AI labs).
Most respect robots.txt. A few don't (or are rumored not to). Major publishers — NYT, Reuters, BBC, Axel Springer — have explicitly blocked them. Smaller sites are split: some want the AI exposure, others want to protect their content from being trained on without compensation.
The 3 main strategies
- Block all AI crawlers: highest content protection, but you also lose visibility in ChatGPT search, Perplexity citations, Claude Web. Best for paywalled sites and original journalism.
- Block training, allow live retrieval: nuanced.
GPTBottrains,ChatGPT-Useris live retrieval. You can allow one and block the other. Hardest to maintain but gets you AI exposure without training contribution. - Allow all: if your goal is visibility (most marketing sites, blogs), let AI bots in. They'll cite you, drive traffic, and you'll be findable in AI answers.
The 25+ AI crawlers we cover
Beyond the famous ones (GPTBot, ClaudeBot, PerplexityBot), there are crawlers from Cohere (cohere-ai), Diffbot (Diffbot), Amazon (Amazonbot), Meta (FacebookBot, Meta-ExternalAgent), TikTok (Bytespider), Yandex (YandexAI), and several research crawlers (CCBot, Omgili, omgilibot). New ones appear monthly. Our list is updated weekly from public documentation.
Three things people get wrong
- Casing matters:
User-agent: GPTBotworks.User-agent: gptbotworks in most parsers but is technically wrong. - Path matters:
Disallow: /blocks everything.Disallow:(blank) allows everything.Disallow: /private/only blocks that folder. - It's not law-enforced. Robots.txt is a polite request. Bad-faith crawlers ignore it. For real protection, you need WAF rules, rate limiting, or Cloudflare's bot management.
Should you also use llms.txt?
Yes, complement robots.txt with llms.txt — a newer, AI-specific format that tells LLMs what your site is about and how it should be cited. Robots.txt says don't crawl, llms.txt says here's what I am if you do crawl. They're complementary, not competing.
FAQ
Will blocking AI crawlers hurt my Google rankings? No. Googlebot is for search, Google-Extended is for Gemini training. Blocking Google-Extended doesn't affect your search ranking.
What if I want to allow some content but not other? You can scope: User-agent: GPTBot / Disallow: /paid/ / Allow: /. Block paywalled pages, allow public pages.
How do I verify it's working? Use Google Search Console's robots.txt tester for Googlebot. For others, check your access logs after deployment — you should see the user-agent strings of bots respecting (or violating) your rules.