🧠 Built for ML engineers · 25+ AI crawlers · Free

AI Robots.txt Generator for ML Engineers

Your model demos, technical write-ups and inference endpoints don't belong inside a competitor's training corpus. Generate a robots.txt tuned for ML portfolios, paper mirrors and Hugging Face Spaces — in under a minute.

What an ML engineer's robots.txt looks like

# robots.txt for an ML engineer / researcher
# Goal: protect model demos & research, keep search + AI citations

# --- Search engines: fully allowed ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# --- AI training crawlers: blocked ---
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

# --- Live citation crawlers: allowed for recruiter discoverability ---
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# --- Inference endpoints: blocked for everyone ---
User-agent: *
Disallow: /api/
Disallow: /predict/
Disallow: /infer/

Sitemap: https://yourdomain.dev/sitemap.xml

Why ML engineers need this

📓

Protect novel write-ups

You spent two months distilling a paper into a clear blog post. There is no reason to hand that distillation to a model that will reproduce it tomorrow with no link to you.

🧪

Stop endpoint scraping

Crawlers iterating through /predict burn GPU minutes and rate limits. Disallowing /api/ and /infer/ stops the cheapest source of bill shock cold.

🔍

Stay findable by recruiters

Block training crawlers, keep OAI-SearchBot and PerplexityBot allowed. When a hiring manager asks ChatGPT "who's shipped sub-100ms retrieval?" your portfolio still surfaces.

📄

Defend your arXiv mirror

If you host preprints on your own domain, a clean robots.txt opt-out preserves your authorship signal before the next CommonCrawl snapshot vacuums it up.

🛠️

Set a professional standard

If you also build crawlers for evals or RAG pipelines, publishing a clean robots.txt is the credibility signal that you understand both sides of the contract.

The dual-sided problem for ML engineers

Most niches face the AI training question from one direction only: how do I keep crawlers out? ML engineers sit on both sides of the table. You publish to the open web — demos, model cards, ablation notes, write-ups, retrospectives — and you also build crawlers, evaluation harnesses, retrieval pipelines and dataset collectors that fetch other people's content. The same robots.txt file that protects your own work is the contract you are expected to honor when you build the next one.

This page is about the first half — protecting your own surface area. The default for an ML engineer's personal site, blog, demo Space or research project page should be: training crawlers blocked, citation crawlers allowed, search engines untouched, inference endpoints disallowed for everyone. The generator on this site presets exactly that combination — toggle individual bots if you want to deviate.

What goes on an ML engineer's domain (and why each surface matters)

Most ML engineer personal sites contain four kinds of pages, and each one has a different relationship to AI crawlers:

The Google-Extended question, for the people building Gemini

Many ML engineers ask whether blocking Google-Extended is hypocritical when their day job involves training models on web data. It isn't. Google-Extended is an explicit opt-out signal that Google itself shipped specifically so site owners could express a preference. Honoring (or expressing) that preference is exactly how the protocol is supposed to work. Blocking it on your personal domain has no effect on Google Search rankings — Googlebot and Google-Extended are separate user agents — and it sets a coherent professional norm: if you build crawlers that obey robots.txt, you also write one that uses it.

Inference endpoints: the silent budget killer

The single most common preventable bill for an indie ML engineer in 2026 is a crawler iterating across a public /api/predict endpoint. CCBot, Bytespider and a long tail of unbranded scrapers will happily fire thousands of POST requests at any URL they discover, and serverless GPU bills do not forgive. Two lines of robots.txt prevents 90% of the well-behaved offenders:

For the badly-behaved scrapers, layer in a per-IP rate limit and require a header token. Robots.txt is the polite first line; the firewall is the enforcement.

llms.txt for ML engineers: how to be cited correctly

Once your robots.txt is in place, consider publishing an llms.txt manifest. For an ML engineer this file functions as a structured "press kit" for retrieval-time LLMs: it points at your strongest papers, your shipped models, your benchmark numbers and the canonical URL for each. When a recruiter or PM asks an LLM "who has worked on long-context attention?", a well-formed llms.txt is the difference between being cited as the source versus being silently paraphrased.

The broader landscape: what frontier labs are doing

Practically every major lab now ships a documented training crawler and a separate live-retrieval crawler. The pattern, as of 2026: OpenAI ships GPTBot (training) and OAI-SearchBot + ChatGPT-User (retrieval). Anthropic ships ClaudeBot (training) and Claude-Web (retrieval). Google ships Google-Extended (training) separately from Googlebot. The Cloudflare AI crawler reference tracks the full list and updates it as new ones appear. The split is now standard practice — and it lets you allow citations without contributing training data, which is the right default for almost every ML engineer's public site.

Frequently asked questions

Should an ML engineer block AI crawlers from their portfolio site?

It depends on the surface. Recruiters reach you through search and increasingly through ChatGPT/Perplexity, so citation crawlers are usually worth allowing. Training-only crawlers like GPTBot, ClaudeBot, Google-Extended and Bytespider give you nothing back — they absorb your write-ups into a competing model. The standard playbook is block training, allow citations.

Does blocking Google-Extended hurt my visibility in Gemini?

Google-Extended controls training inclusion, not retrieval. Disallowing it keeps your content out of Gemini training corpora but Gemini's grounding layer still surfaces your pages the same way Google Search does. It has zero effect on Googlebot — your rankings in normal search are unchanged.

What about my inference endpoint at /api/predict?

Disallow it for every crawler. Iterative scrapers running a model on request will burn GPU minutes, blow your rate-limit budget and pollute logs. Pair the robots.txt rule with rate-limiting and a header token — robots.txt alone does not stop bad actors.

Does llms.txt help my ML portfolio get cited?

Yes. llms.txt is a structured manifest telling retrieval-time LLMs which sections of your site matter and how to describe them. For an ML engineer, a good llms.txt highlights papers, model cards, OSS repos and demos — so when a recruiter asks an LLM about your specialty, your name surfaces with a clickable citation.

If I build my own crawler, should I respect other people's robots.txt?

Yes, unambiguously. It is the baseline standard for any engineer running a crawler — evals, dataset collection, RAG, anything. Use a published User-Agent, fetch robots.txt before each origin, cache for 24 hours, respect Disallow. Several frontier labs treat robots.txt compliance as a hiring signal.

Generate your ML portfolio's robots.txt now

Free, no signup. Pick your blocks, copy the file, drop it at the root of your domain. Done in under a minute.

Open the AI Robots.txt Generator →