📰 Built for newsrooms · 25+ AI crawlers · Free

AI Robots.txt Generator for News Publishers

Stop AI labs from training models on your investigative reporting — without sacrificing a single click of search traffic. Block GPTBot, ClaudeBot, Google-Extended and Bytespider in 30 seconds.

What a publisher robots.txt looks like

# robots.txt for a news publisher
# Goal: block AI model training, keep all search engines

# --- Search engines: fully allowed ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# --- AI training crawlers: blocked ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

# --- Live citation crawlers: keep allowed for traffic ---
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot-User
Allow: /

Sitemap: https://yournews.com/sitemap.xml

Why news publishers need this

📜

Protect investigative work

Months of original reporting can be ingested into a foundation model in seconds. A targeted robots.txt is the first legal signal that your archive is not training data.

🔍

Keep Google search traffic

Block Google-Extended (Gemini training) without touching Googlebot. Your stories still rank in Top Stories and the news carousel.

💼

Negotiate from a stronger position

Publishers with a clean opt-out have legal leverage when negotiating licensing deals with OpenAI, Google, Anthropic, and Microsoft.

📰

Stay cited, not consumed

Block training crawlers but keep OAI-SearchBot and PerplexityBot-User allowed — readers still discover your scoop through AI answers, with attribution.

🛡️

Comply with industry guidance

Aligns with the News Media Alliance's recommended opt-out posture and the IPTC's machine-readable rights signals (NoAI, NoImageAI).

The AI training problem for newsrooms

In 2023 The New York Times sued OpenAI and Microsoft alleging that millions of Times articles had been used, without consent or compensation, to train ChatGPT. By 2025 nearly every major U.S. publisher had either filed a similar lawsuit or signed a licensing deal — and a clear pattern emerged: publishers with the cleanest, oldest robots.txt opt-outs negotiated the strongest terms.

If your newsroom has not yet codified an AI position in robots.txt, every article you publish is presumed scrapeable. The fix is a 60-second config change. This generator is preset for editorial sites: it allows every search engine, blocks every known training crawler, and keeps live-citation crawlers allowed so your reporting still surfaces in AI answers with a clickable citation.

Which AI crawlers should a news site block?

The 2026 landscape sorts AI crawlers into three buckets. Your robots.txt should treat them differently:

1. Training-only crawlers (block these)

These bots exist to vacuum the web for foundation-model training data. They send no traffic back. Blocking is pure upside:

2. Live-citation crawlers (consider keeping allowed)

These bots fetch your article in real time when a user asks a question, then cite you with a link. They drive measurable referral traffic. Many publishers leave these allowed even when blocking training:

3. Search engines (always allowed)

Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo) — never block these. They are your traffic.

The Google-Extended trap

This is the single most common publisher mistake. Googlebot indexes your site for Search. Google-Extended is a separate user agent introduced in late 2023 specifically so publishers could opt out of Gemini training without losing Search visibility. They are completely independent — disallowing one has no effect on the other. If you only have a generic User-agent: * Disallow: / rule, you have blocked everything including Search. If you have nothing at all, Google is using your articles to train Gemini. The sweet spot is Google-Extended: Disallow with Googlebot: Allow.

What about llms.txt?

The llms.txt proposal is gaining traction in 2026 as a complement to robots.txt. Where robots.txt is a "stay out" signal, llms.txt is a "here is the curated version of our site that we are happy for you to surface." For newsrooms this is mostly relevant for evergreen reference content (explainers, glossaries, stylebooks) — not for daily reporting. Ship robots.txt first; revisit llms.txt for your reference vertical later.

Belt and suspenders: enforce at the edge

robots.txt is a polite request. Reputable AI labs (OpenAI, Anthropic, Google) honor it. Less reputable scrapers do not. For high-value investigative content, pair the robots.txt block with a CDN-level User-Agent firewall rule. Cloudflare's free tier ships a one-click "Block AI Bots" rule that enforces the same list at the network edge — even bots that ignore robots.txt cannot get past it.

Frequently asked questions

Can I block AI training crawlers but still appear in Google search?

Yes. Googlebot (search) and Google-Extended (Gemini training) are separate user agents. Disallow Google-Extended and leave Googlebot allowed — your articles continue to appear in Google Search and Top Stories without being used to train Gemini.

Will blocking GPTBot remove my articles from ChatGPT answers?

Disallowing GPTBot prevents OpenAI from training future models on your content. To also remove your site from ChatGPT's live web browsing and citations, additionally disallow OAI-SearchBot and ChatGPT-User. Many publishers block GPTBot but allow OAI-SearchBot so ChatGPT can still cite their reporting.

What is the difference between robots.txt and llms.txt for newsrooms?

robots.txt is the enforceable opt-out signal — major AI labs honor it. llms.txt is an emerging proposal for a curated, machine-readable summary of your site for LLMs that do retrieve content. Newsrooms need a strict robots.txt first.

Does Bytespider really respect robots.txt?

ByteDance's Bytespider had a poor compliance reputation historically, but as of 2024 it does honor robots.txt disallow directives. For belt-and-suspenders enforcement, pair the robots.txt block with a server-side User-Agent firewall rule at the CDN.

Should we block AI crawlers if we have a licensing deal with OpenAI or Google?

If you license content to a specific lab, leave their crawler allowed and block the others. The generator lets you toggle each user agent individually so a publisher with an OpenAI deal can allow GPTBot while still blocking ClaudeBot, PerplexityBot and Google-Extended.

Generate your newsroom's robots.txt now

Free, no signup. Pick your blocks, copy the file, drop it at the root of your domain. Done in under a minute.

Open the AI Robots.txt Generator →