✍️ Built for indie bloggers · 25+ AI crawlers · Free

AI Robots.txt Generator for Bloggers

Stop OpenAI, Anthropic, Google and ByteDance from training their next model on your posts — without losing a single visitor from Google, Bing, or AI search citations. Built for WordPress, Ghost, and static blogs.

What a blogger's robots.txt looks like

# robots.txt for an indie blog
# Goal: block AI training, keep search + AI citation traffic

# --- Search engines: fully allowed ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

# --- AI training crawlers: blocked ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# --- Live-citation crawlers: keep allowed for traffic ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://yourblog.com/sitemap.xml

Why bloggers need this in 2026

📝

Stop free training on your archive

Five years of Saturday-morning writing can be ingested by a single training run. A targeted robots.txt is the legal opt-out signal every major AI lab has agreed to honor.

🔍

Don't lose Google traffic

Block Google-Extended (Gemini training) without touching Googlebot. Your posts keep ranking in Search and AI Overviews still cite you with a link.

💸

Protect AdSense and affiliate revenue

If a reader gets the full answer from ChatGPT trained on your post, they never visit. Blocking training preserves the click that pays your hosting bill.

🤖

Stay quoted, not consumed

Allow OAI-SearchBot and PerplexityBot so AI answer engines still cite you with attribution — you get traffic, they don't get training data.

Set it once, ship in 30 seconds

Generate, copy, paste into your WordPress or Ghost robots.txt editor. Five minutes of work that locks in your AI policy across every post you'll ever publish.

The AI training problem for indie bloggers

Between 2022 and 2025, virtually every major foundation model was trained partly on Common Crawl — a public dump of the open web that includes hundreds of millions of blog posts. Bloggers got nothing. The labs got the models. Search-engine click-throughs to articles dropped between 15% and 35% in the same window because Google's AI Overviews and ChatGPT started answering the question directly from your content.

Two years later, the answer for indie writers has settled: opt out of training, opt in to citation. Block the bots that take your work to train the next model. Keep allowed the bots that fetch your post to answer a live question, with a link back to you. Done well, the policy looks like the example above and takes one minute to deploy.

The four crawler categories every blogger should know

1. Training-only crawlers — block these

These bots only exist to harvest data for foundation-model training. They do not send any traffic back to your blog. Blocking them is pure upside:

2. Live-citation crawlers — usually keep allowed

These bots fetch a single page on demand when a user asks an AI a question, and the AI cites your post with a clickable link. They are how AI search drives traffic to indie blogs:

3. Search engines — always allowed

Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo), YandexBot — these are your traffic. Never block them.

4. Generic spam bots — out of scope here

Scrapers that ignore robots.txt entirely (content thieves, SEO data brokers) are not stopped by any robots.txt. For those you need server-side protection — a Cloudflare WAF rule or a plugin like Wordfence on WordPress.

The Google-Extended trick every blogger gets wrong

This is the single biggest mistake in blogger robots.txt files. Googlebot is the search crawler that decides whether your post ranks for "best espresso machine 2026". Google-Extended is a separate user agent Google introduced in late 2023 specifically so publishers could opt out of Gemini training without losing Search visibility. The two are completely independent.

If you set User-agent: * Disallow: /, you have just deindexed your blog from Google entirely. If you set nothing at all, Google is using your posts to train Gemini for free. The right answer is the one in the example above: Googlebot: Allow and Google-Extended: Disallow. The same logic applies to Apple: keep Applebot allowed, block Applebot-Extended.

How to deploy on the most common blogging platforms

WordPress

WordPress generates a virtual robots.txt at /robots.txt by default, but it's not what you want. Two ways to override it:

Ghost (self-hosted)

Drop the file at /content/files/robots.txt or use a routes.yaml override that maps /robots.txt to your file. Ghost(Pro) doesn't allow this — you're stuck with Ghost's defaults, which block GPTBot since 2024 but not the full list.

Static sites (Jekyll, Hugo, Eleventy, Astro, Next.js)

Drop robots.txt in your static output folder (public/, _site/, dist/, depending on the generator). Commit, deploy, done. Vercel, Netlify and Cloudflare Pages serve it automatically from the project root.

Substack, Medium, and other hosted platforms

You don't control robots.txt on substack.com or medium.com. Substack added a platform-wide AI opt-out toggle in 2024 — turn it on in your publication's settings. Medium has a similar opt-out in account settings. If full control matters to you, the long-term fix is moving to your own domain with WordPress, Ghost self-hosted, or a static site.

Belt and suspenders: enforce at the edge

robots.txt is a polite request. Major AI labs (OpenAI, Anthropic, Google, Apple, Perplexity, Microsoft) have publicly committed to honoring it for their crawlers. A small fringe of scrapers ignores it. For belt-and-suspenders enforcement, Cloudflare's free tier ships a one-click "Block AI Bots" rule that enforces the same list at the network edge — even bots that ignore robots.txt cannot get past it. Layer the two and you've covered both the polite and the impolite paths.

Frequently asked questions

Will blocking AI crawlers hurt my Google or Bing rankings?

No. Googlebot (search) and Google-Extended (Gemini training) are completely separate user agents. The same is true for Applebot vs Applebot-Extended. You can block every AI training crawler in your robots.txt and your blog will rank in Google and Bing exactly as before.

Should I block AI crawlers or let them in for traffic?

The mainstream indie answer in 2026 is: block training crawlers (GPTBot, ClaudeBot, Google-Extended, Bytespider) but keep live-citation crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) allowed. You stop free training, you keep the AI-search traffic.

How do I add robots.txt on WordPress?

Two options: install Yoast SEO or Rank Math (both have a robots.txt editor in the dashboard) or upload a static robots.txt to your site root via FTP or your host's file manager. The static file overrides WordPress's default virtual one.

Can I add robots.txt on Substack, Medium or Ghost(Pro)?

On self-hosted Ghost: yes. On Substack, Medium, and Ghost(Pro): no — you don't control the platform's robots.txt. Use the platform's built-in AI opt-out toggle (Substack and Medium added one in 2024). For full control, host on your own domain.

Will AI crawlers actually respect my robots.txt?

OpenAI, Anthropic, Google, Apple, Perplexity and Microsoft publicly commit to honoring it. Bytespider and CCBot also do. A small number of fringe scrapers ignore it — for those, pair the file with Cloudflare's free "Block AI Bots" edge rule. For 95% of the training-data flow, robots.txt is the working opt-out.

Generate your blog's robots.txt in 30 seconds

Free, no signup. Pick which bots to block, copy the file, paste into Yoast or upload to your site root. One config change, lifetime AI policy.

Open the AI Robots.txt Generator →