Free AI Robots.txt Generator for Bloggers — Stop ChatGPT & Gemini Training on Your Posts (2026)

The AI training problem for indie bloggers

Between 2022 and 2025, virtually every major foundation model was trained partly on Common Crawl — a public dump of the open web that includes hundreds of millions of blog posts. Bloggers got nothing. The labs got the models. Search-engine click-throughs to articles dropped between 15% and 35% in the same window because Google's AI Overviews and ChatGPT started answering the question directly from your content.

Two years later, the answer for indie writers has settled: opt out of training, opt in to citation. Block the bots that take your work to train the next model. Keep allowed the bots that fetch your post to answer a live question, with a link back to you. Done well, the policy looks like the example above and takes one minute to deploy.

The four crawler categories every blogger should know

1. Training-only crawlers — block these

These bots only exist to harvest data for foundation-model training. They do not send any traffic back to your blog. Blocking them is pure upside:

GPTBot — OpenAI's training crawler
ClaudeBot and anthropic-ai — Anthropic training
Google-Extended — Gemini training (totally separate from Googlebot search)
Applebot-Extended — Apple Intelligence training
Bytespider — TikTok/ByteDance training
CCBot — Common Crawl, the training-data backbone of most open-source LLMs
Meta-ExternalAgent, cohere-ai, Diffbot, Omgilibot — smaller training crawlers

2. Live-citation crawlers — usually keep allowed

These bots fetch a single page on demand when a user asks an AI a question, and the AI cites your post with a clickable link. They are how AI search drives traffic to indie blogs:

OAI-SearchBot — ChatGPT search and live answers
ChatGPT-User — when a ChatGPT user clicks "browse the web"
PerplexityBot and Perplexity-User — Perplexity answer engine
Bingbot — Microsoft search and Copilot grounding (also gets you into search results)

3. Search engines — always allowed

Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo), YandexBot — these are your traffic. Never block them.

4. Generic spam bots — out of scope here

Scrapers that ignore robots.txt entirely (content thieves, SEO data brokers) are not stopped by any robots.txt. For those you need server-side protection — a Cloudflare WAF rule or a plugin like Wordfence on WordPress.

The Google-Extended trick every blogger gets wrong

This is the single biggest mistake in blogger robots.txt files. Googlebot is the search crawler that decides whether your post ranks for "best espresso machine 2026". Google-Extended is a separate user agent Google introduced in late 2023 specifically so publishers could opt out of Gemini training without losing Search visibility. The two are completely independent.

If you set User-agent: * Disallow: /, you have just deindexed your blog from Google entirely. If you set nothing at all, Google is using your posts to train Gemini for free. The right answer is the one in the example above: Googlebot: Allow and Google-Extended: Disallow. The same logic applies to Apple: keep Applebot allowed, block Applebot-Extended.

How to deploy on the most common blogging platforms

WordPress

WordPress generates a virtual robots.txt at /robots.txt by default, but it's not what you want. Two ways to override it:

Plugin route: Install Yoast SEO or Rank Math. Both have a robots.txt editor under SEO → Tools. Paste the generated file, save.
Static file route: Upload the file as /robots.txt in your site root via FTP, cPanel File Manager, or your host's dashboard. The static file always overrides the WordPress virtual one.

Ghost (self-hosted)

Drop the file at /content/files/robots.txt or use a routes.yaml override that maps /robots.txt to your file. Ghost(Pro) doesn't allow this — you're stuck with Ghost's defaults, which block GPTBot since 2024 but not the full list.

Static sites (Jekyll, Hugo, Eleventy, Astro, Next.js)

Drop robots.txt in your static output folder (public/, _site/, dist/, depending on the generator). Commit, deploy, done. Vercel, Netlify and Cloudflare Pages serve it automatically from the project root.

Substack, Medium, and other hosted platforms

You don't control robots.txt on substack.com or medium.com. Substack added a platform-wide AI opt-out toggle in 2024 — turn it on in your publication's settings. Medium has a similar opt-out in account settings. If full control matters to you, the long-term fix is moving to your own domain with WordPress, Ghost self-hosted, or a static site.

Belt and suspenders: enforce at the edge

robots.txt is a polite request. Major AI labs (OpenAI, Anthropic, Google, Apple, Perplexity, Microsoft) have publicly committed to honoring it for their crawlers. A small fringe of scrapers ignores it. For belt-and-suspenders enforcement, Cloudflare's free tier ships a one-click "Block AI Bots" rule that enforces the same list at the network edge — even bots that ignore robots.txt cannot get past it. Layer the two and you've covered both the polite and the impolite paths.

Frequently asked questions

Will blocking AI crawlers hurt my Google or Bing rankings?

No. Googlebot (search) and Google-Extended (Gemini training) are completely separate user agents. The same is true for Applebot vs Applebot-Extended. You can block every AI training crawler in your robots.txt and your blog will rank in Google and Bing exactly as before.

Should I block AI crawlers or let them in for traffic?

The mainstream indie answer in 2026 is: block training crawlers (GPTBot, ClaudeBot, Google-Extended, Bytespider) but keep live-citation crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) allowed. You stop free training, you keep the AI-search traffic.

How do I add robots.txt on WordPress?

Two options: install Yoast SEO or Rank Math (both have a robots.txt editor in the dashboard) or upload a static robots.txt to your site root via FTP or your host's file manager. The static file overrides WordPress's default virtual one.

Can I add robots.txt on Substack, Medium or Ghost(Pro)?

On self-hosted Ghost: yes. On Substack, Medium, and Ghost(Pro): no — you don't control the platform's robots.txt. Use the platform's built-in AI opt-out toggle (Substack and Medium added one in 2024). For full control, host on your own domain.

Will AI crawlers actually respect my robots.txt?

OpenAI, Anthropic, Google, Apple, Perplexity and Microsoft publicly commit to honoring it. Bytespider and CCBot also do. A small number of fringe scrapers ignore it — for those, pair the file with Cloudflare's free "Block AI Bots" edge rule. For 95% of the training-data flow, robots.txt is the working opt-out.

AI Robots.txt Generator for Bloggers

What a blogger's robots.txt looks like

Why bloggers need this in 2026

Stop free training on your archive

Don't lose Google traffic

Protect AdSense and affiliate revenue

Stay quoted, not consumed

Set it once, ship in 30 seconds