llms.txt and AI Crawler Configuration

Make your content discoverable, crawlable, and citable by AI engines — guide to the llms.txt standard and robots.txt configuration.

Machine translation

This page was machine-translated from the Chinese original. Please report inaccuracies to mail@geo.fan.

TL;DR: To be cited by AI engines, two things must be in place first — AI crawlers must not be blocked by your robots.txt, and you must publish llms.txt / llms-full.txt so AI can quickly understand your site structure. Both are technical configurations, each about half a day's work, and they form the "foundation" of GEO.

1. AI crawler allowlist (robots.txt)

Many sites use a blanket User-agent: * rule to block crawlers, accidentally catching AI crawlers in the net. If AI engines can't crawl your content, they can't cite it.

Major AI crawler User-Agent list

Crawler	User-Agent	Operator	Purpose
GPTBot	`GPTBot`	OpenAI	Trains GPT models
OAI-SearchBot	`OAI-SearchBot`	OpenAI	Provides real-time index for ChatGPT search / citations
ChatGPT-User	`ChatGPT-User`	OpenAI	Used when a user actively accesses a page from within ChatGPT
ClaudeBot	`ClaudeBot`	Anthropic	Trains Claude models
Claude-SearchBot	`Claude-SearchBot`	Anthropic	Claude real-time search
Claude-User	`Claude-User`	Anthropic	Claude user-initiated access
PerplexityBot	`PerplexityBot`	Perplexity	Perplexity search index
Perplexity-User	`Perplexity-User`	Perplexity	Perplexity user access
Google-Extended	`Google-Extended`	Google	Controls whether content is used to train Gemini
Googlebot	`Googlebot`	Google	Google search (including AI Overviews input)
Bingbot	`Bingbot`	Microsoft	Bing search (including Copilot input)
Applebot-Extended	`Applebot-Extended`	Apple	Controls whether content is used to train Apple Intelligence
Bytespider	`Bytespider`	ByteDance	Doubao / ByteDance-system AI
Amazonbot	`Amazonbot`	Amazon	Alexa / Amazon-system AI
Meta-ExternalAgent	`Meta-ExternalAgent`	Meta	Meta AI

Recommended robots.txt template

# Public content allowed for all crawlers by default
User-agent: *
Allow: /

# Explicitly allow major AI crawlers (even if covered above, listing is safer)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# Block paths that should not be crawled (example)
User-agent: *
Disallow: /admin/
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

Distinguishing "training" from "real-time retrieval"

Different bots serve different purposes — handle them separately:

Training bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) — decide whether your content becomes part of model weights. If copyright is a concern, you can selectively Disallow.
Real-time retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, Bingbot) — these crawl your pages when AI answers are generated. You should almost never block these, or your content will never appear in AI citations.

2. The llms.txt standard

llms.txt is a proposal from Jeremy Howard (September 2024): a markdown file placed at the root of your site, serving as an "AI-specific sitemap + content index."

By 2026, developer-tool sites including Anthropic, Vercel, Cursor, Mintlify, and LangGraph have all deployed it. Major LLM providers haven't officially committed to consuming this file yet, but the IDE / agent ecosystem (e.g. Claude Code, Cursor) actually uses it in practice.

llms.txt vs llms-full.txt

The mainstream practice is to provide both:

File	Content	Purpose
`/llms.txt`	An index: site summary + title, URL, and one-line description for each important page	Helps AI quickly judge which page is relevant before fetching the full content
`/llms-full.txt`	Full text concatenation: all important pages' markdown body packed together	Allows AI to consume the complete knowledge in one go, saving multiple fetches

llms.txt format specification

Minimum viable structure (per the llmstxt.org spec):

# Site name

> A one-paragraph description of what this site does and its positioning.

Optional additional background.

## Documentation

- [Page title](https://example.com/path/): one-line description of what this page covers
- [Another page](https://example.com/another/): description

## Tools

- [Tool page](https://example.com/tools/x/): description

## Optional

- [Lower-priority page](https://example.com/extra/): description

Level-1 heading (#): site name
Blockquote (>): site positioning / one-line summary
Level-2 headings (##): content groups
List items: each in the form [title](URL): description
## Optional is a conventional special group that AI may skip

Deployment tips

Hand-write or script-generate: small volumes can be hand-written; otherwise generate from frontmatter at build time
Keep it fresh: update synchronously every time you publish new content
Coexists with sitemap.xml: doesn't replace it, complements it
Fixed path: must be at /llms.txt, not elsewhere

This site's setup

geo.fan has both files deployed and can serve as a reference:

3. `/.well-known/ai.txt` (parallel standard)

ai.txt is another proposal that emerged in parallel with llms.txt, placed at /.well-known/ai.txt (following the RFC 8615 well-known URI spec). Its design leans more toward a structured "AI policy declaration" — telling AI crawlers "which data is allowed for which purpose."

The format is more granular than robots.txt:

# /.well-known/ai.txt
User-agent: *
Disallow-Training: /private/
Allow-Search: /
Allow-Summarization: /
Attribution-Required: true
Contact: mail@example.com

As of 2026, this standard is still not officially adopted by mainstream LLM vendors, but some sites (Anthropic, several media outlets) have deployed it. Recommendations:

No conflicts, safe to deploy: cost is very low; can coexist with llms.txt
Don't replace robots.txt: compliance still goes through robots.txt + each vendor's own policies
Don't include sensitive information: this is a public file

4. `sameAs` — getting AI to identify you as a known entity

Not at the robots/llms.txt layer, but a sibling "technical foundation" setting: add sameAs to your site's Organization / Person JSON-LD, linking to external authoritative entity databases.

The most worth adding:

"sameAs": [
  "https://www.wikidata.org/wiki/Q...",     // most important
  "https://www.linkedin.com/company/...",
  "https://www.crunchbase.com/organization/...",
  "https://github.com/...",
  "https://www.g2.com/products/..."
]

When AI crawlers read sameAs, they can match "the entity that calls itself X on this site" to "the Wikidata entry Q12345" — a critical step in entity resolution. See Content Strategy → sameAs entity linking.

5. Other technical foundations

SSR is essential

AI crawlers generally don't execute JavaScript, or executing it is far more expensive than fetching static HTML. If key content only appears after JS rendering (a typical CSR SPA), AI sees a blank page.

Doc sites, content sites: go straight to SSG / SSR for stability
Heavy interactive apps: SSR the "content body," enhance the "interaction layer" with JS
Detection: curl -A "GPTBot" https://your-site/page — check whether the returned HTML contains the body

Content that requires paid access or login can't be fetched by AI engines and naturally won't be cited. If you must have a paywall, at least provide a "summary + key conclusions" free version as a teaser.

Don't accidentally return 4xx/5xx to AI bots

Some CDNs (Cloudflare, Akamai) have default bot-protection rules that flag AI crawlers as malicious traffic and return 403. Check and explicitly allow the User-Agents listed above.

Checklist

Before publishing:

/robots.txt explicitly allows OAI-SearchBot, PerplexityBot, Claude-SearchBot, Bingbot
/llms.txt exists and contains site summary + main page index
/llms-full.txt exists and is current (optional but recommended)
Key pages are SSR / SSG; curl -A "GPTBot" <URL> returns complete HTML
CDN / WAF isn't treating AI bots as attack traffic
sitemap.xml contains all pages that should be discoverable

llms.txt and AI Crawler Configuration

On this page