🚀 Built for B2B & indie SaaS · 25+ AI crawlers · Free

AI Robots.txt Generator for SaaS Founders

Stop OpenAI, Anthropic, Google and ByteDance from training their next model on your marketing site, product docs, changelog and pricing pages — without losing a single signup from Google, Bing, ChatGPT search or Perplexity. Built for Next.js, Astro, Vercel, Netlify and Cloudflare Pages.

What a SaaS robots.txt looks like

# robots.txt for a B2B SaaS marketing site
# Goal: block AI training, keep search + AI citation signups

# --- Search engines: fully allowed ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: DuckDuckBot
Allow: /

# --- Live-citation crawlers: keep allowed (these drive signups) ---
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# --- AI training crawlers: blocked across the marketing site ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Diffbot
Disallow: /

# --- Belt-and-suspenders: keep app routes off-limits to all bots ---
User-agent: *
Disallow: /app/
Disallow: /api/
Disallow: /admin/

Sitemap: https://yourstartup.com/sitemap.xml
Sitemap: https://yourstartup.com/llms.txt

Why SaaS founders need this in 2026

📚

Protect your docs from being memorized

If a competitor's bundled AI assistant is trained on your full SDK and changelog, the developer never visits your docs again — they just ask the assistant. Block training, keep live citation, keep the signup path.

💰

Don't leak pricing strategy to AI Overviews

Your pricing, plan tiers and packaging are months of GTM iteration. Letting them flow into Gemini and GPT-5 training data hands competitors a free benchmark. Block Google-Extended and GPTBot on /pricing.

🥊

Keep comparison pages out of training

Your "[Us] vs [Competitor]" pages are pure positioning. Let AI Overviews cite them with a link (good — drives clicks); don't let the next foundation model memorize them and answer the question without you (bad).

🔍

Don't lose Google traffic

Block Google-Extended (Gemini training) without touching Googlebot. Your marketing site keeps ranking and AI Overviews still cite you with a clickable link.

🛡️

Pair with llms.txt for full control

Robots.txt blocks the bots you don't want. llms.txt at the root tells the bots you do want how to summarize your product. Most SaaS sites in 2026 ship both — defense and shaping in one deploy.

Drop into /public, ship in 30 seconds

Generate, copy, paste into /public/robots.txt on Vercel, Netlify or Cloudflare Pages. Auto-served at the root of your domain. No code, no plugin, no infra change.

The AI training problem for SaaS in 2026

If you ship a B2B SaaS in 2026, your moat lives in three places: the product itself (auth-walled, safe), the marketing site (public, exposed), and the documentation (public, exposed). The marketing site and the docs are exactly what foundation models eat. A single Common Crawl dump pulls down your entire homepage, every feature page, every blog post, every API reference, every changelog entry — and 18 months later a competitor's coding agent answers "show me how [your product category] handles webhook retries" without ever sending a developer to your domain.

This is not theoretical. Anthropic, OpenAI, Google, Apple, Meta and ByteDance all run named training crawlers on the public web in 2026. They've all signed onto robots.txt as the opt-out signal. The default behavior, if you ship nothing, is that all of them ingest your site quietly. The right SaaS posture in 2026 is the same one indie writers settled on: opt out of training, opt in to citation. Block the bots that take your work to train the next model. Keep allowed the bots that fetch your page on demand to answer a live question, with a link back to you.

What to block, what to allow — the SaaS playbook

1. Training-only crawlers — block these

These bots only exist to harvest data for foundation-model training. They send zero traffic back to your site. Blocking them is pure upside:

2. Live-citation crawlers — keep allowed (these drive signups)

These bots fetch a single page on demand when a user asks an AI a question, and the AI cites your page with a clickable link. They are the new top-of-funnel for SaaS:

3. Search engines — always allowed

Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo) and YandexBot are your free traffic. Never block them.

4. Auth-walled app routes — already private

Your /app, /api, /dashboard and /admin routes should sit behind authentication anyway. Add a belt-and-suspenders Disallow for those paths under User-agent: * so accidental misconfigurations don't expose anything to crawlers. Customer data should never appear in robots.txt — if a path is sensitive, the answer is auth, not robots directives.

The Google-Extended trick every SaaS gets wrong

The single biggest mistake we see in SaaS robots.txt files: founders write User-agent: * Disallow: / in a panic about AI training, and accidentally deindex their entire marketing site from Google. The fix is to know that Googlebot and Google-Extended are completely different user agents. Googlebot is the search crawler that decides whether /blog/best-saas-onboarding-2026 ranks for that query. Google-Extended is the user agent Google introduced specifically so publishers could opt out of Gemini training without losing search visibility. Block one, allow the other. The same logic applies to Apple: keep Applebot allowed (Siri search), block Applebot-Extended (Apple Intelligence training).

Should the robots.txt for a SaaS look different from a blog?

Mostly no, but a few SaaS-specific touches matter:

How to deploy on a typical SaaS stack

Most SaaS sites in 2026 run on one of four stacks. Each one auto-serves /public/robots.txt at the root of your domain — no code change required:

After deploy, validate with Google Search Console's robots.txt tester and a quick curl https://yourdomain.com/robots.txt. The whole job takes about ten minutes and locks in your AI policy across every page you'll ever ship.

The bigger picture: an AI policy for your SaaS

Robots.txt is necessary but not sufficient. The full SaaS-grade AI policy in 2026 is roughly:

Ship the robots.txt today. It's the easiest 10 minutes in your week and the only one that's still working for you in 2028.

FAQ

Does my SaaS legally need to block AI training crawlers?

No jurisdiction currently requires it. But the EU AI Act, the UK's CDEI guidance, and several US state proposals in 2025–2026 explicitly recognize robots.txt and equivalent machine-readable opt-outs as the standard signal that a website does not consent to its content being used for AI training. Shipping a strict robots.txt strengthens any future legal position you'd want to take if your content shows up verbatim in a model.

Will this break my Google ranking?

No. Googlebot (search) and Google-Extended (Gemini training) are separate user agents. The example above keeps Googlebot fully allowed.

What about my docs site?

Same playbook: block GPTBot, ClaudeBot, Google-Extended, Bytespider; allow OAI-SearchBot, PerplexityBot, Bingbot. You want users asking ChatGPT how to use your product to be sent to your docs, not to get an answer from a memorized model.

Will this stop competitors from scraping my site manually?

No — robots.txt only addresses well-behaved crawlers. For determined scrapers, layer a CDN-level WAF rule (Cloudflare's free "Block AI Bots" or AWS WAF managed rules) and rate-limiting on top.

Ship your SaaS robots.txt in 30 seconds

Generate the file, copy it into /public/robots.txt, deploy. Lock in your AI policy across every page you'll ever ship.

Try the AI Robots.txt Generator →