The AI training problem for indie bloggers
Between 2022 and 2025, virtually every major foundation model was trained partly on Common Crawl — a public dump of the open web that includes hundreds of millions of blog posts. Bloggers got nothing. The labs got the models. Search-engine click-throughs to articles dropped between 15% and 35% in the same window because Google's AI Overviews and ChatGPT started answering the question directly from your content.
Two years later, the answer for indie writers has settled: opt out of training, opt in to citation. Block the bots that take your work to train the next model. Keep allowed the bots that fetch your post to answer a live question, with a link back to you. Done well, the policy looks like the example above and takes one minute to deploy.
The four crawler categories every blogger should know
1. Training-only crawlers — block these
These bots only exist to harvest data for foundation-model training. They do not send any traffic back to your blog. Blocking them is pure upside:
GPTBot— OpenAI's training crawlerClaudeBotandanthropic-ai— Anthropic trainingGoogle-Extended— Gemini training (totally separate from Googlebot search)Applebot-Extended— Apple Intelligence trainingBytespider— TikTok/ByteDance trainingCCBot— Common Crawl, the training-data backbone of most open-source LLMsMeta-ExternalAgent,cohere-ai,Diffbot,Omgilibot— smaller training crawlers
2. Live-citation crawlers — usually keep allowed
These bots fetch a single page on demand when a user asks an AI a question, and the AI cites your post with a clickable link. They are how AI search drives traffic to indie blogs:
OAI-SearchBot— ChatGPT search and live answersChatGPT-User— when a ChatGPT user clicks "browse the web"PerplexityBotandPerplexity-User— Perplexity answer engineBingbot— Microsoft search and Copilot grounding (also gets you into search results)
3. Search engines — always allowed
Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo), YandexBot — these are your traffic. Never block them.
4. Generic spam bots — out of scope here
Scrapers that ignore robots.txt entirely (content thieves, SEO data brokers) are not stopped by any robots.txt. For those you need server-side protection — a Cloudflare WAF rule or a plugin like Wordfence on WordPress.
The Google-Extended trick every blogger gets wrong
This is the single biggest mistake in blogger robots.txt files. Googlebot is the search crawler that decides whether your post ranks for "best espresso machine 2026". Google-Extended is a separate user agent Google introduced in late 2023 specifically so publishers could opt out of Gemini training without losing Search visibility. The two are completely independent.
If you set User-agent: * Disallow: /, you have just deindexed your blog from Google entirely. If you set nothing at all, Google is using your posts to train Gemini for free. The right answer is the one in the example above: Googlebot: Allow and Google-Extended: Disallow. The same logic applies to Apple: keep Applebot allowed, block Applebot-Extended.
How to deploy on the most common blogging platforms
WordPress
WordPress generates a virtual robots.txt at /robots.txt by default, but it's not what you want. Two ways to override it:
- Plugin route: Install Yoast SEO or Rank Math. Both have a robots.txt editor under SEO → Tools. Paste the generated file, save.
- Static file route: Upload the file as
/robots.txtin your site root via FTP, cPanel File Manager, or your host's dashboard. The static file always overrides the WordPress virtual one.
Ghost (self-hosted)
Drop the file at /content/files/robots.txt or use a routes.yaml override that maps /robots.txt to your file. Ghost(Pro) doesn't allow this — you're stuck with Ghost's defaults, which block GPTBot since 2024 but not the full list.
Static sites (Jekyll, Hugo, Eleventy, Astro, Next.js)
Drop robots.txt in your static output folder (public/, _site/, dist/, depending on the generator). Commit, deploy, done. Vercel, Netlify and Cloudflare Pages serve it automatically from the project root.
Substack, Medium, and other hosted platforms
You don't control robots.txt on substack.com or medium.com. Substack added a platform-wide AI opt-out toggle in 2024 — turn it on in your publication's settings. Medium has a similar opt-out in account settings. If full control matters to you, the long-term fix is moving to your own domain with WordPress, Ghost self-hosted, or a static site.
Belt and suspenders: enforce at the edge
robots.txt is a polite request. Major AI labs (OpenAI, Anthropic, Google, Apple, Perplexity, Microsoft) have publicly committed to honoring it for their crawlers. A small fringe of scrapers ignores it. For belt-and-suspenders enforcement, Cloudflare's free tier ships a one-click "Block AI Bots" rule that enforces the same list at the network edge — even bots that ignore robots.txt cannot get past it. Layer the two and you've covered both the polite and the impolite paths.
Frequently asked questions
Will blocking AI crawlers hurt my Google or Bing rankings?
No. Googlebot (search) and Google-Extended (Gemini training) are completely separate user agents. The same is true for Applebot vs Applebot-Extended. You can block every AI training crawler in your robots.txt and your blog will rank in Google and Bing exactly as before.
Should I block AI crawlers or let them in for traffic?
The mainstream indie answer in 2026 is: block training crawlers (GPTBot, ClaudeBot, Google-Extended, Bytespider) but keep live-citation crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) allowed. You stop free training, you keep the AI-search traffic.
How do I add robots.txt on WordPress?
Two options: install Yoast SEO or Rank Math (both have a robots.txt editor in the dashboard) or upload a static robots.txt to your site root via FTP or your host's file manager. The static file overrides WordPress's default virtual one.
Can I add robots.txt on Substack, Medium or Ghost(Pro)?
On self-hosted Ghost: yes. On Substack, Medium, and Ghost(Pro): no — you don't control the platform's robots.txt. Use the platform's built-in AI opt-out toggle (Substack and Medium added one in 2024). For full control, host on your own domain.
Will AI crawlers actually respect my robots.txt?
OpenAI, Anthropic, Google, Apple, Perplexity and Microsoft publicly commit to honoring it. Bytespider and CCBot also do. A small number of fringe scrapers ignore it — for those, pair the file with Cloudflare's free "Block AI Bots" edge rule. For 95% of the training-data flow, robots.txt is the working opt-out.