The AI training problem for SaaS in 2026
If you ship a B2B SaaS in 2026, your moat lives in three places: the product itself (auth-walled, safe), the marketing site (public, exposed), and the documentation (public, exposed). The marketing site and the docs are exactly what foundation models eat. A single Common Crawl dump pulls down your entire homepage, every feature page, every blog post, every API reference, every changelog entry — and 18 months later a competitor's coding agent answers "show me how [your product category] handles webhook retries" without ever sending a developer to your domain.
This is not theoretical. Anthropic, OpenAI, Google, Apple, Meta and ByteDance all run named training crawlers on the public web in 2026. They've all signed onto robots.txt as the opt-out signal. The default behavior, if you ship nothing, is that all of them ingest your site quietly. The right SaaS posture in 2026 is the same one indie writers settled on: opt out of training, opt in to citation. Block the bots that take your work to train the next model. Keep allowed the bots that fetch your page on demand to answer a live question, with a link back to you.
What to block, what to allow — the SaaS playbook
1. Training-only crawlers — block these
These bots only exist to harvest data for foundation-model training. They send zero traffic back to your site. Blocking them is pure upside:
GPTBot— OpenAI's training crawlerClaudeBotandanthropic-ai— Anthropic trainingGoogle-Extended— Gemini training (totally separate from Googlebot search)Applebot-Extended— Apple Intelligence trainingBytespider— TikTok / ByteDance / Doubao trainingCCBot— Common Crawl, the training-data backbone of most open-source LLMsMeta-ExternalAgent,cohere-ai,Diffbot,Omgilibot— smaller training crawlers
2. Live-citation crawlers — keep allowed (these drive signups)
These bots fetch a single page on demand when a user asks an AI a question, and the AI cites your page with a clickable link. They are the new top-of-funnel for SaaS:
OAI-SearchBot— ChatGPT search and live answersChatGPT-User— when a ChatGPT user clicks "browse the web"PerplexityBotandPerplexity-User— Perplexity answer engineBingbot— Microsoft search and Copilot grounding
3. Search engines — always allowed
Googlebot, Bingbot, DuckDuckBot, Slurp (Yahoo) and YandexBot are your free traffic. Never block them.
4. Auth-walled app routes — already private
Your /app, /api, /dashboard and /admin routes should sit behind authentication anyway. Add a belt-and-suspenders Disallow for those paths under User-agent: * so accidental misconfigurations don't expose anything to crawlers. Customer data should never appear in robots.txt — if a path is sensitive, the answer is auth, not robots directives.
The Google-Extended trick every SaaS gets wrong
The single biggest mistake we see in SaaS robots.txt files: founders write User-agent: * Disallow: / in a panic about AI training, and accidentally deindex their entire marketing site from Google. The fix is to know that Googlebot and Google-Extended are completely different user agents. Googlebot is the search crawler that decides whether /blog/best-saas-onboarding-2026 ranks for that query. Google-Extended is the user agent Google introduced specifically so publishers could opt out of Gemini training without losing search visibility. Block one, allow the other. The same logic applies to Apple: keep Applebot allowed (Siri search), block Applebot-Extended (Apple Intelligence training).
Should the robots.txt for a SaaS look different from a blog?
Mostly no, but a few SaaS-specific touches matter:
- Two sitemaps — list
/sitemap.xmlfor crawlers and/llms.txtfor AI assistants.llms.txtis a standard for handing live-citation bots a clean summary of your product, key docs and pricing so when ChatGPT or Perplexity answer a user, they answer accurately and link to you. - Block your auth routes —
/app/,/api/,/admin/,/account/,/dashboard/. These sit behind auth so crawlers can't reach them anyway, but the explicit Disallow is cheap insurance against future misconfigurations. - Don't block /docs — counterintuitively, you want developers asking ChatGPT "how do I authenticate with [your-product]" to get an accurate answer that links to your docs. That's a signup. Block training crawlers on /docs (so the model can't memorize the SDK), allow live-citation crawlers (so users still get pointed to you).
- Be careful with /pricing — your pricing, plans and packaging took months to land. Blocking training crawlers there protects your GTM strategy from being rolled into competitors' AI tools.
How to deploy on a typical SaaS stack
Most SaaS sites in 2026 run on one of four stacks. Each one auto-serves /public/robots.txt at the root of your domain — no code change required:
- Next.js (Vercel) — drop the file at
/public/robots.txt. Or, for a programmatic version that can branch on env, export arobots.tsroute handler from/app/robots.ts. - Astro / Vite / SvelteKit (Netlify or Vercel) — drop the file at
/public/robots.txt. Verify after deploy by curlinghttps://yourdomain.com/robots.txt. - Cloudflare Pages — place the file at the root of your build output directory. Combine with Cloudflare's free "Block AI Bots" managed WAF rule to also stop the minority of crawlers that ignore robots.txt.
- Marketing site separate from app — if your marketing site is a Webflow/Framer/Wordpress instance and your app is on a different domain (
app.yourdomain.com), put a strict robots.txt on the marketing domain and a near-empty one on the app domain (everything's behind auth there anyway).
After deploy, validate with Google Search Console's robots.txt tester and a quick curl https://yourdomain.com/robots.txt. The whole job takes about ten minutes and locks in your AI policy across every page you'll ever ship.
The bigger picture: an AI policy for your SaaS
Robots.txt is necessary but not sufficient. The full SaaS-grade AI policy in 2026 is roughly:
- robots.txt at the root — block training crawlers, keep live-citation and search crawlers.
- llms.txt at the root — a markdown file telling allowed AI assistants what your product is, with links to key docs and pricing.
- Cloudflare / Vercel WAF rule for AI bots — handles the holdouts that ignore robots.txt.
- An AI Disclosure in your privacy policy — required in many jurisdictions if you use AI in the product. (TinyTools also has a free AI Disclosure Generator if you don't have one yet.)
- Terms of Service language — explicit "no training" clause for content you publish. The robots.txt is the technical signal; ToS is the legal one. Both matter.
Ship the robots.txt today. It's the easiest 10 minutes in your week and the only one that's still working for you in 2028.