GEO.FanGEO.Fan
Technical Setup

llms.txt and AI Crawler Configuration

Make your content discoverable, crawlable, and citable by AI engines — guide to the llms.txt standard and robots.txt configuration.

Machine translation

This page was machine-translated from the Chinese original. Please report inaccuracies to mail@geo.fan.

TL;DR: To be cited by AI engines, two things must be in place first — AI crawlers must not be blocked by your robots.txt, and you must publish llms.txt / llms-full.txt so AI can quickly understand your site structure. Both are technical configurations, each about half a day's work, and they form the "foundation" of GEO.

1. AI crawler allowlist (robots.txt)

Many sites use a blanket User-agent: * rule to block crawlers, accidentally catching AI crawlers in the net. If AI engines can't crawl your content, they can't cite it.

Major AI crawler User-Agent list

CrawlerUser-AgentOperatorPurpose
GPTBotGPTBotOpenAITrains GPT models
OAI-SearchBotOAI-SearchBotOpenAIProvides real-time index for ChatGPT search / citations
ChatGPT-UserChatGPT-UserOpenAIUsed when a user actively accesses a page from within ChatGPT
ClaudeBotClaudeBotAnthropicTrains Claude models
Claude-SearchBotClaude-SearchBotAnthropicClaude real-time search
Claude-UserClaude-UserAnthropicClaude user-initiated access
PerplexityBotPerplexityBotPerplexityPerplexity search index
Perplexity-UserPerplexity-UserPerplexityPerplexity user access
Google-ExtendedGoogle-ExtendedGoogleControls whether content is used to train Gemini
GooglebotGooglebotGoogleGoogle search (including AI Overviews input)
BingbotBingbotMicrosoftBing search (including Copilot input)
Applebot-ExtendedApplebot-ExtendedAppleControls whether content is used to train Apple Intelligence
BytespiderBytespiderByteDanceDoubao / ByteDance-system AI
AmazonbotAmazonbotAmazonAlexa / Amazon-system AI
Meta-ExternalAgentMeta-ExternalAgentMetaMeta AI
# Public content allowed for all crawlers by default
User-agent: *
Allow: /

# Explicitly allow major AI crawlers (even if covered above, listing is safer)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

# Block paths that should not be crawled (example)
User-agent: *
Disallow: /admin/
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

Distinguishing "training" from "real-time retrieval"

Different bots serve different purposes — handle them separately:

  • Training bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) — decide whether your content becomes part of model weights. If copyright is a concern, you can selectively Disallow.
  • Real-time retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, Bingbot) — these crawl your pages when AI answers are generated. You should almost never block these, or your content will never appear in AI citations.

2. The llms.txt standard

llms.txt is a proposal from Jeremy Howard (September 2024): a markdown file placed at the root of your site, serving as an "AI-specific sitemap + content index."

By 2026, developer-tool sites including Anthropic, Vercel, Cursor, Mintlify, and LangGraph have all deployed it. Major LLM providers haven't officially committed to consuming this file yet, but the IDE / agent ecosystem (e.g. Claude Code, Cursor) actually uses it in practice.

llms.txt vs llms-full.txt

The mainstream practice is to provide both:

FileContentPurpose
/llms.txtAn index: site summary + title, URL, and one-line description for each important pageHelps AI quickly judge which page is relevant before fetching the full content
/llms-full.txtFull text concatenation: all important pages' markdown body packed togetherAllows AI to consume the complete knowledge in one go, saving multiple fetches

llms.txt format specification

Minimum viable structure (per the llmstxt.org spec):

# Site name

> A one-paragraph description of what this site does and its positioning.

Optional additional background.

## Documentation

- [Page title](https://example.com/path/): one-line description of what this page covers
- [Another page](https://example.com/another/): description

## Tools

- [Tool page](https://example.com/tools/x/): description

## Optional

- [Lower-priority page](https://example.com/extra/): description
  • Level-1 heading (#): site name
  • Blockquote (>): site positioning / one-line summary
  • Level-2 headings (##): content groups
  • List items: each in the form [title](URL): description
  • ## Optional is a conventional special group that AI may skip

Deployment tips

  1. Hand-write or script-generate: small volumes can be hand-written; otherwise generate from frontmatter at build time
  2. Keep it fresh: update synchronously every time you publish new content
  3. Coexists with sitemap.xml: doesn't replace it, complements it
  4. Fixed path: must be at /llms.txt, not elsewhere

This site's setup

geo.fan has both files deployed and can serve as a reference:

3. /.well-known/ai.txt (parallel standard)

ai.txt is another proposal that emerged in parallel with llms.txt, placed at /.well-known/ai.txt (following the RFC 8615 well-known URI spec). Its design leans more toward a structured "AI policy declaration" — telling AI crawlers "which data is allowed for which purpose."

The format is more granular than robots.txt:

# /.well-known/ai.txt
User-agent: *
Disallow-Training: /private/
Allow-Search: /
Allow-Summarization: /
Attribution-Required: true
Contact: mail@example.com

As of 2026, this standard is still not officially adopted by mainstream LLM vendors, but some sites (Anthropic, several media outlets) have deployed it. Recommendations:

  • No conflicts, safe to deploy: cost is very low; can coexist with llms.txt
  • Don't replace robots.txt: compliance still goes through robots.txt + each vendor's own policies
  • Don't include sensitive information: this is a public file

4. sameAs — getting AI to identify you as a known entity

Not at the robots/llms.txt layer, but a sibling "technical foundation" setting: add sameAs to your site's Organization / Person JSON-LD, linking to external authoritative entity databases.

The most worth adding:

"sameAs": [
  "https://www.wikidata.org/wiki/Q...",     // most important
  "https://www.linkedin.com/company/...",
  "https://www.crunchbase.com/organization/...",
  "https://github.com/...",
  "https://www.g2.com/products/..."
]

When AI crawlers read sameAs, they can match "the entity that calls itself X on this site" to "the Wikidata entry Q12345" — a critical step in entity resolution. See Content Strategy → sameAs entity linking.

5. Other technical foundations

SSR is essential

AI crawlers generally don't execute JavaScript, or executing it is far more expensive than fetching static HTML. If key content only appears after JS rendering (a typical CSR SPA), AI sees a blank page.

  • Doc sites, content sites: go straight to SSG / SSR for stability
  • Heavy interactive apps: SSR the "content body," enhance the "interaction layer" with JS
  • Detection: curl -A "GPTBot" https://your-site/page — check whether the returned HTML contains the body

Don't hide content behind a login wall

Content that requires paid access or login can't be fetched by AI engines and naturally won't be cited. If you must have a paywall, at least provide a "summary + key conclusions" free version as a teaser.

Don't accidentally return 4xx/5xx to AI bots

Some CDNs (Cloudflare, Akamai) have default bot-protection rules that flag AI crawlers as malicious traffic and return 403. Check and explicitly allow the User-Agents listed above.

Checklist

Before publishing:

  • /robots.txt explicitly allows OAI-SearchBot, PerplexityBot, Claude-SearchBot, Bingbot
  • /llms.txt exists and contains site summary + main page index
  • /llms-full.txt exists and is current (optional but recommended)
  • Key pages are SSR / SSG; curl -A "GPTBot" <URL> returns complete HTML
  • CDN / WAF isn't treating AI bots as attack traffic
  • sitemap.xml contains all pages that should be discoverable
GEO.Fan

GEO.Fan — make your content trusted and cited by AI engines

© 2026 GEO.Fan