The dual-sided problem for ML engineers
Most niches face the AI training question from one direction only: how do I keep crawlers out? ML engineers sit on both sides of the table. You publish to the open web — demos, model cards, ablation notes, write-ups, retrospectives — and you also build crawlers, evaluation harnesses, retrieval pipelines and dataset collectors that fetch other people's content. The same robots.txt file that protects your own work is the contract you are expected to honor when you build the next one.
This page is about the first half — protecting your own surface area. The default for an ML engineer's personal site, blog, demo Space or research project page should be: training crawlers blocked, citation crawlers allowed, search engines untouched, inference endpoints disallowed for everyone. The generator on this site presets exactly that combination — toggle individual bots if you want to deviate.
What goes on an ML engineer's domain (and why each surface matters)
Most ML engineer personal sites contain four kinds of pages, and each one has a different relationship to AI crawlers:
- Technical blog posts and notes. Original distillations of papers, reproduction reports, ablation studies. Highest training-data value, highest risk of being absorbed without attribution. Block training crawlers, allow citation crawlers.
- Model demos and inference endpoints. A Hugging Face Space embed, a self-hosted Gradio app, a
/predictroute on your domain. Crawlers fetching these cost real GPU money. Disallow underUser-agent: *for every bot. - Project pages and model cards. Public documentation of OSS work, often the page recruiters land on. You usually want these surfaced by AI citations — allow OAI-SearchBot, PerplexityBot and ChatGPT-User. Optionally allow training crawlers too if the page already mirrors a public README.
- Resume, contact, and recruiter pages. Highest-value pages for hiring. Search engines and citation crawlers, all allowed. Training crawlers, blocked — there is no upside to having your CV reproduced inside a model's output.
The Google-Extended question, for the people building Gemini
Many ML engineers ask whether blocking Google-Extended is hypocritical when their day job involves training models on web data. It isn't. Google-Extended is an explicit opt-out signal that Google itself shipped specifically so site owners could express a preference. Honoring (or expressing) that preference is exactly how the protocol is supposed to work. Blocking it on your personal domain has no effect on Google Search rankings — Googlebot and Google-Extended are separate user agents — and it sets a coherent professional norm: if you build crawlers that obey robots.txt, you also write one that uses it.
Inference endpoints: the silent budget killer
The single most common preventable bill for an indie ML engineer in 2026 is a crawler iterating across a public /api/predict endpoint. CCBot, Bytespider and a long tail of unbranded scrapers will happily fire thousands of POST requests at any URL they discover, and serverless GPU bills do not forgive. Two lines of robots.txt prevents 90% of the well-behaved offenders:
User-agent: *Disallow: /api/Disallow: /predict/Disallow: /infer/
For the badly-behaved scrapers, layer in a per-IP rate limit and require a header token. Robots.txt is the polite first line; the firewall is the enforcement.
llms.txt for ML engineers: how to be cited correctly
Once your robots.txt is in place, consider publishing an llms.txt manifest. For an ML engineer this file functions as a structured "press kit" for retrieval-time LLMs: it points at your strongest papers, your shipped models, your benchmark numbers and the canonical URL for each. When a recruiter or PM asks an LLM "who has worked on long-context attention?", a well-formed llms.txt is the difference between being cited as the source versus being silently paraphrased.
The broader landscape: what frontier labs are doing
Practically every major lab now ships a documented training crawler and a separate live-retrieval crawler. The pattern, as of 2026: OpenAI ships GPTBot (training) and OAI-SearchBot + ChatGPT-User (retrieval). Anthropic ships ClaudeBot (training) and Claude-Web (retrieval). Google ships Google-Extended (training) separately from Googlebot. The Cloudflare AI crawler reference tracks the full list and updates it as new ones appear. The split is now standard practice — and it lets you allow citations without contributing training data, which is the right default for almost every ML engineer's public site.
Frequently asked questions
Should an ML engineer block AI crawlers from their portfolio site?
It depends on the surface. Recruiters reach you through search and increasingly through ChatGPT/Perplexity, so citation crawlers are usually worth allowing. Training-only crawlers like GPTBot, ClaudeBot, Google-Extended and Bytespider give you nothing back — they absorb your write-ups into a competing model. The standard playbook is block training, allow citations.
Does blocking Google-Extended hurt my visibility in Gemini?
Google-Extended controls training inclusion, not retrieval. Disallowing it keeps your content out of Gemini training corpora but Gemini's grounding layer still surfaces your pages the same way Google Search does. It has zero effect on Googlebot — your rankings in normal search are unchanged.
What about my inference endpoint at /api/predict?
Disallow it for every crawler. Iterative scrapers running a model on request will burn GPU minutes, blow your rate-limit budget and pollute logs. Pair the robots.txt rule with rate-limiting and a header token — robots.txt alone does not stop bad actors.
Does llms.txt help my ML portfolio get cited?
Yes. llms.txt is a structured manifest telling retrieval-time LLMs which sections of your site matter and how to describe them. For an ML engineer, a good llms.txt highlights papers, model cards, OSS repos and demos — so when a recruiter asks an LLM about your specialty, your name surfaces with a clickable citation.
If I build my own crawler, should I respect other people's robots.txt?
Yes, unambiguously. It is the baseline standard for any engineer running a crawler — evals, dataset collection, RAG, anything. Use a published User-Agent, fetch robots.txt before each origin, cache for 24 hours, respect Disallow. Several frontier labs treat robots.txt compliance as a hiring signal.