The AI training problem for newsrooms
In 2023 The New York Times sued OpenAI and Microsoft alleging that millions of Times articles had been used, without consent or compensation, to train ChatGPT. By 2025 nearly every major U.S. publisher had either filed a similar lawsuit or signed a licensing deal — and a clear pattern emerged: publishers with the cleanest, oldest robots.txt opt-outs negotiated the strongest terms.
If your newsroom has not yet codified an AI position in robots.txt, every article you publish is presumed scrapeable. The fix is a 60-second config change. This generator is preset for editorial sites: it allows every search engine, blocks every known training crawler, and keeps live-citation crawlers allowed so your reporting still surfaces in AI answers with a clickable citation.
Which AI crawlers should a news site block?
The 2026 landscape sorts AI crawlers into three buckets. Your robots.txt should treat them differently:
1. Training-only crawlers (block these)
These bots exist to vacuum the web for foundation-model training data. They send no traffic back. Blocking is pure upside:
GPTBot— OpenAI's training crawlerClaudeBotandanthropic-ai— Anthropic trainingGoogle-Extended— Gemini training (separate from Googlebot search)Applebot-Extended— Apple Intelligence training (separate from Applebot)Bytespider— ByteDance/TikTok trainingCCBot— Common Crawl, used by most open-source modelsMeta-ExternalAgent— Meta AI trainingcohere-ai,Diffbot,Omgilibot— smaller training crawlers
2. Live-citation crawlers (consider keeping allowed)
These bots fetch your article in real time when a user asks a question, then cite you with a link. They drive measurable referral traffic. Many publishers leave these allowed even when blocking training:
OAI-SearchBot— ChatGPT live web answersPerplexityBot-User— Perplexity user-triggered fetchChatGPT-User— ChatGPT browsing toolBingbot— Microsoft search and Copilot grounding
3. Search engines (always allowed)
Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo) — never block these. They are your traffic.
The Google-Extended trap
This is the single most common publisher mistake. Googlebot indexes your site for Search. Google-Extended is a separate user agent introduced in late 2023 specifically so publishers could opt out of Gemini training without losing Search visibility. They are completely independent — disallowing one has no effect on the other. If you only have a generic User-agent: * Disallow: / rule, you have blocked everything including Search. If you have nothing at all, Google is using your articles to train Gemini. The sweet spot is Google-Extended: Disallow with Googlebot: Allow.
What about llms.txt?
The llms.txt proposal is gaining traction in 2026 as a complement to robots.txt. Where robots.txt is a "stay out" signal, llms.txt is a "here is the curated version of our site that we are happy for you to surface." For newsrooms this is mostly relevant for evergreen reference content (explainers, glossaries, stylebooks) — not for daily reporting. Ship robots.txt first; revisit llms.txt for your reference vertical later.
Belt and suspenders: enforce at the edge
robots.txt is a polite request. Reputable AI labs (OpenAI, Anthropic, Google) honor it. Less reputable scrapers do not. For high-value investigative content, pair the robots.txt block with a CDN-level User-Agent firewall rule. Cloudflare's free tier ships a one-click "Block AI Bots" rule that enforces the same list at the network edge — even bots that ignore robots.txt cannot get past it.
Frequently asked questions
Can I block AI training crawlers but still appear in Google search?
Yes. Googlebot (search) and Google-Extended (Gemini training) are separate user agents. Disallow Google-Extended and leave Googlebot allowed — your articles continue to appear in Google Search and Top Stories without being used to train Gemini.
Will blocking GPTBot remove my articles from ChatGPT answers?
Disallowing GPTBot prevents OpenAI from training future models on your content. To also remove your site from ChatGPT's live web browsing and citations, additionally disallow OAI-SearchBot and ChatGPT-User. Many publishers block GPTBot but allow OAI-SearchBot so ChatGPT can still cite their reporting.
What is the difference between robots.txt and llms.txt for newsrooms?
robots.txt is the enforceable opt-out signal — major AI labs honor it. llms.txt is an emerging proposal for a curated, machine-readable summary of your site for LLMs that do retrieve content. Newsrooms need a strict robots.txt first.
Does Bytespider really respect robots.txt?
ByteDance's Bytespider had a poor compliance reputation historically, but as of 2024 it does honor robots.txt disallow directives. For belt-and-suspenders enforcement, pair the robots.txt block with a server-side User-Agent firewall rule at the CDN.
Should we block AI crawlers if we have a licensing deal with OpenAI or Google?
If you license content to a specific lab, leave their crawler allowed and block the others. The generator lets you toggle each user agent individually so a publisher with an OpenAI deal can allow GPTBot while still blocking ClaudeBot, PerplexityBot and Google-Extended.