AI crawler control in 2026: robots.txt, llms.txt, and bots

In 2026, your robots.txt file stopped being a quiet instruction for Googlebot and became a frontline budget-and-bandwidth decision. AI crawlers now probe your origin server constantly, and they’re wildly uneven in value: roughly 89% of AI crawler traffic is for training or mixed purposes, only about 8% is search-related, and just ~2% responds to an actual user query. One widely-cited analysis found Anthropic’s training crawler hitting over 20,000 pages for every single referral it sent back. The job in 2026 is surgical — allow the bots that drive citations and clicks, block or rate-limit the ones that just consume your infrastructure to train a model, and stop guessing about llms.txt.

The core distinction: training bots vs. retrieval bots

Most AI providers now run two crawlers, and conflating them is the most common mistake. One trains the model offline; the other fetches live to answer a user’s question right now and can cite you. You almost always want to block the first and allow the second.

Provider	Block (training/scraping)	Allow (search / user-triggered)
OpenAI	GPTBot	OAI-SearchBot, ChatGPT-User
Anthropic	ClaudeBot	Claude-SearchBot, Claude-User
Perplexity	—	PerplexityBot
Google	Google-Extended	Googlebot (and AI Overviews use it)
Common Crawl	CCBot	—
Meta	Meta-ExternalAgent	—
Apple	Applebot-Extended	Applebot
Amazon	Amazonbot	—
ByteDance	Bytespider	—

Anthropic, for example, split its agents: ClaudeBot is the training scraper (block it to save bandwidth), while the search/user agents drive citation traffic from real queries (allow them). Note Google’s wrinkle: there’s no separate AI-Overviews bot — AI Overviews are served from the same Googlebot crawl, so you cannot block AI Overview usage without blocking Google Search itself. Google-Extended only governs Gemini model training, not Overviews.

A practical robots.txt template

A defensible 2026 default for most businesses that want AI citations but not to subsidize model training:

# Search engines — full access
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /

# AI search & user-triggered retrieval — allow (these can cite you)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /

# Training & bulk scrapers — block (no citation value, heavy load)
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /

Sitemap: https://www.example.com/sitemap-index.xml

Two caveats. First, robots.txt is voluntary — well-behaved bots honor it; aggressive scrapers don’t, so pair it with WAF/firewall rules or Cloudflare’s bot controls for the ones that ignore directives. Second, the allow/block call is a business decision: a publisher protecting paywalled IP blocks aggressively; a brand desperate for AI visibility may allow more. Decide deliberately.

llms.txt: contested, not standard

llms.txt is a proposed file (a curated, markdown list of your key content for LLMs) that’s had a lot of hype. Be clear-eyed: Google has publicly advised against building AI reference files like llms.txt, and stated that what drives AI visibility is the same fundamentals as search — crawlable content and authority — not a special file. Multiple analyses found no major AI engine confirming they consume llms.txt for ranking.

So where does that leave you?

Don’t expect llms.txt to move rankings. No engine has confirmed it as a ranking input. Treat claims otherwise as speculative.
It’s cheap to add and low-risk if you want to, as a curated index for the smaller tools that do experiment with it — but don’t prioritize it over real content and authority work.
Spend the effort on fundamentals instead: crawlable server-rendered HTML, E-E-A-T signals, third-party citations, and answer engine optimization. That’s what actually earns AI citations.

The rendering trap that blocks AI bots entirely

The most damaging technical issue isn’t your robots.txt — it’s client-side rendering. AI crawlers do not execute JavaScript the way a browser does; most read only the HTML your server returns. If your important content is rendered client-side after page load, AI engines often see an empty shell and can’t cite what they can’t read. Server-render or pre-render anything you want surfaced. This is the same discipline that protects Core Web Vitals and is why static and server-rendered sites have a quiet advantage in the AI-search era.

A crawl-budget and bot-management checklist

Audit your logs. Identify which AI bots are hitting you and how hard. You can’t manage what you don’t measure.
Block the training crawlers that send no referrals (GPTBot, ClaudeBot, CCBot, Bytespider, Google-Extended, Meta-ExternalAgent).
Explicitly allow the retrieval bots that can cite you (OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot/User).
Rate-limit at the edge for bots that ignore robots.txt — Cloudflare bot rules or WAF.
Serve real HTML for anything you want cited; don’t hide content behind client-side rendering.
Keep your sitemap clean and current — it’s still how every crawler, AI or not, discovers your pages efficiently.

FAQ: AI crawler control

Will blocking GPTBot hurt my ChatGPT visibility? No — that’s the key distinction. GPTBot is for training. ChatGPT’s live answers that can cite you come from OAI-SearchBot and ChatGPT-User. Block GPTBot and allow those two and you keep citation eligibility while not feeding the training corpus.

Can I block AI Overviews specifically? Not without blocking Google Search. AI Overviews are served from the standard Googlebot crawl. Google-Extended only controls Gemini model training, not Overviews.

Should I add an llms.txt file? It won’t hurt and it’s cheap, but no major engine has confirmed it as a ranking input, and Google advises against relying on it. Prioritize crawlable content and authority over a special file.

Do aggressive scrapers obey robots.txt? Well-behaved bots do; many don’t. Back robots.txt with edge rate-limiting (Cloudflare, WAF) for bots that ignore directives.

Why does crawl budget suddenly matter? Because AI training crawlers can hit thousands of pages per referral, driving up egress and server load while returning little to no traffic. Managing them is now a real cost-control issue, not just an SEO nicety.

The honest take

AI crawler control in 2026 is a deliberate two-part decision: which bots earn access to your content, and whether your content is even readable to them. Block the training scrapers that consume bandwidth and return nothing; explicitly allow the retrieval bots that can cite you; and make sure your important content is server-rendered HTML, not a client-side shell. Skip the llms.txt hype — Google itself says it’s not the lever. The fundamentals that win AI citations are the same ones that have always won search: accessible content and real authority. The robots.txt file just decides who gets to use them.