llms.txt and AI Crawler Configuration
Make your content discoverable, crawlable, and citable by AI engines — guide to the llms.txt standard and robots.txt configuration.
Machine translation
TL;DR: To be cited by AI engines, two things must be in place first — AI crawlers must not be blocked by your robots.txt, and you must publish llms.txt / llms-full.txt so AI can quickly understand your site structure. Both are technical configurations, each about half a day's work, and they form the "foundation" of GEO.
1. AI crawler allowlist (robots.txt)
Many sites use a blanket User-agent: * rule to block crawlers, accidentally catching AI crawlers in the net. If AI engines can't crawl your content, they can't cite it.
Major AI crawler User-Agent list
| Crawler | User-Agent | Operator | Purpose |
|---|---|---|---|
| GPTBot | GPTBot | OpenAI | Trains GPT models |
| OAI-SearchBot | OAI-SearchBot | OpenAI | Provides real-time index for ChatGPT search / citations |
| ChatGPT-User | ChatGPT-User | OpenAI | Used when a user actively accesses a page from within ChatGPT |
| ClaudeBot | ClaudeBot | Anthropic | Trains Claude models |
| Claude-SearchBot | Claude-SearchBot | Anthropic | Claude real-time search |
| Claude-User | Claude-User | Anthropic | Claude user-initiated access |
| PerplexityBot | PerplexityBot | Perplexity | Perplexity search index |
| Perplexity-User | Perplexity-User | Perplexity | Perplexity user access |
| Google-Extended | Google-Extended | Controls whether content is used to train Gemini | |
| Googlebot | Googlebot | Google search (including AI Overviews input) | |
| Bingbot | Bingbot | Microsoft | Bing search (including Copilot input) |
| Applebot-Extended | Applebot-Extended | Apple | Controls whether content is used to train Apple Intelligence |
| Bytespider | Bytespider | ByteDance | Doubao / ByteDance-system AI |
| Amazonbot | Amazonbot | Amazon | Alexa / Amazon-system AI |
| Meta-ExternalAgent | Meta-ExternalAgent | Meta | Meta AI |
Recommended robots.txt template
# Public content allowed for all crawlers by default
User-agent: *
Allow: /
# Explicitly allow major AI crawlers (even if covered above, listing is safer)
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
# Block paths that should not be crawled (example)
User-agent: *
Disallow: /admin/
Disallow: /api/
Sitemap: https://example.com/sitemap.xmlDistinguishing "training" from "real-time retrieval"
Different bots serve different purposes — handle them separately:
- Training bots (
GPTBot,ClaudeBot,Google-Extended,Applebot-Extended) — decide whether your content becomes part of model weights. If copyright is a concern, you can selectivelyDisallow. - Real-time retrieval bots (
OAI-SearchBot,Claude-SearchBot,PerplexityBot,ChatGPT-User,Bingbot) — these crawl your pages when AI answers are generated. You should almost never block these, or your content will never appear in AI citations.
2. The llms.txt standard
llms.txt is a proposal from Jeremy Howard (September 2024): a markdown file placed at the root of your site, serving as an "AI-specific sitemap + content index."
By 2026, developer-tool sites including Anthropic, Vercel, Cursor, Mintlify, and LangGraph have all deployed it. Major LLM providers haven't officially committed to consuming this file yet, but the IDE / agent ecosystem (e.g. Claude Code, Cursor) actually uses it in practice.
llms.txt vs llms-full.txt
The mainstream practice is to provide both:
| File | Content | Purpose |
|---|---|---|
/llms.txt | An index: site summary + title, URL, and one-line description for each important page | Helps AI quickly judge which page is relevant before fetching the full content |
/llms-full.txt | Full text concatenation: all important pages' markdown body packed together | Allows AI to consume the complete knowledge in one go, saving multiple fetches |
llms.txt format specification
Minimum viable structure (per the llmstxt.org spec):
# Site name
> A one-paragraph description of what this site does and its positioning.
Optional additional background.
## Documentation
- [Page title](https://example.com/path/): one-line description of what this page covers
- [Another page](https://example.com/another/): description
## Tools
- [Tool page](https://example.com/tools/x/): description
## Optional
- [Lower-priority page](https://example.com/extra/): description- Level-1 heading (
#): site name - Blockquote (
>): site positioning / one-line summary - Level-2 headings (
##): content groups - List items: each in the form
[title](URL): description ## Optionalis a conventional special group that AI may skip
Deployment tips
- Hand-write or script-generate: small volumes can be hand-written; otherwise generate from frontmatter at build time
- Keep it fresh: update synchronously every time you publish new content
- Coexists with sitemap.xml: doesn't replace it, complements it
- Fixed path: must be at
/llms.txt, not elsewhere
This site's setup
geo.fan has both files deployed and can serve as a reference:
3. /.well-known/ai.txt (parallel standard)
ai.txt is another proposal that emerged in parallel with llms.txt, placed at /.well-known/ai.txt (following the RFC 8615 well-known URI spec). Its design leans more toward a structured "AI policy declaration" — telling AI crawlers "which data is allowed for which purpose."
The format is more granular than robots.txt:
# /.well-known/ai.txt
User-agent: *
Disallow-Training: /private/
Allow-Search: /
Allow-Summarization: /
Attribution-Required: true
Contact: mail@example.comAs of 2026, this standard is still not officially adopted by mainstream LLM vendors, but some sites (Anthropic, several media outlets) have deployed it. Recommendations:
- No conflicts, safe to deploy: cost is very low; can coexist with
llms.txt - Don't replace
robots.txt: compliance still goes through robots.txt + each vendor's own policies - Don't include sensitive information: this is a public file
4. sameAs — getting AI to identify you as a known entity
Not at the robots/llms.txt layer, but a sibling "technical foundation" setting: add sameAs to your site's Organization / Person JSON-LD, linking to external authoritative entity databases.
The most worth adding:
"sameAs": [
"https://www.wikidata.org/wiki/Q...", // most important
"https://www.linkedin.com/company/...",
"https://www.crunchbase.com/organization/...",
"https://github.com/...",
"https://www.g2.com/products/..."
]When AI crawlers read sameAs, they can match "the entity that calls itself X on this site" to "the Wikidata entry Q12345" — a critical step in entity resolution. See Content Strategy → sameAs entity linking.
5. Other technical foundations
SSR is essential
AI crawlers generally don't execute JavaScript, or executing it is far more expensive than fetching static HTML. If key content only appears after JS rendering (a typical CSR SPA), AI sees a blank page.
- Doc sites, content sites: go straight to SSG / SSR for stability
- Heavy interactive apps: SSR the "content body," enhance the "interaction layer" with JS
- Detection:
curl -A "GPTBot" https://your-site/page— check whether the returned HTML contains the body
Don't hide content behind a login wall
Content that requires paid access or login can't be fetched by AI engines and naturally won't be cited. If you must have a paywall, at least provide a "summary + key conclusions" free version as a teaser.
Don't accidentally return 4xx/5xx to AI bots
Some CDNs (Cloudflare, Akamai) have default bot-protection rules that flag AI crawlers as malicious traffic and return 403. Check and explicitly allow the User-Agents listed above.
Checklist
Before publishing:
-
/robots.txtexplicitly allowsOAI-SearchBot,PerplexityBot,Claude-SearchBot,Bingbot -
/llms.txtexists and contains site summary + main page index -
/llms-full.txtexists and is current (optional but recommended) - Key pages are SSR / SSG;
curl -A "GPTBot" <URL>returns complete HTML - CDN / WAF isn't treating AI bots as attack traffic
-
sitemap.xmlcontains all pages that should be discoverable
Related reading
- GEO Checker
- Content Strategy Best Practices
- External spec: llmstxt.org
Resources
A curated index of GEO learning resources — academic papers, industry reports, the Chinese AI ecosystem map, tools, and courses.
GEO Content Strategy Best Practices
A 2026 content strategy for generative AI engines — actionable checklists for format, citations, freshness, and structured data.