Most Publishers Are Invisible
to AI Assistants
We audited 103 publisher sites against the Scaletific Agent-Readability Index. Five fixes lift any site by roughly forty points in a single sprint.
Every AI crawler the audit recognises
Each row is sourced from the operator's own documentation. Training crawlers feed model weights. Citation crawlers feed live answers.
| User-Agent | Operator | Type | Function |
|---|---|---|---|
GPTBot | OpenAI | Training | Builds dataset for future GPT models |
OAI-SearchBot | OpenAI | Citation | Surfaces sites in ChatGPT search answers |
ChatGPT-User | OpenAI | User-triggered | Live fetch when a ChatGPT user asks for a URL |
OAI-AdsBot | OpenAI | Ad validation | Validates ad landing pages submitted to ChatGPT |
ClaudeBot | Anthropic | Training | Builds dataset for Claude models |
Claude-User | Anthropic | User-triggered | Live fetch when a Claude user asks a question |
Claude-SearchBot | Anthropic | Citation | Improves Claude's search result quality |
PerplexityBot | Perplexity | Citation | Surfaces sites in Perplexity answers |
Perplexity-User | Perplexity | User-triggered | Live fetch on user request |
Googlebot | Search index | Powers Google Search; output feeds AI Overviews | |
Google-Extended | Training | Opts the site out of Gemini training and grounding | |
Google-CloudVertexBot | Site-owner-requested | Crawls for site-owner-built Vertex AI Agents | |
Applebot | Apple | Search + AI | Powers Spotlight, Siri, Safari; data may train AI |
Applebot-Extended | Apple | Training opt-out signal | Metadata-only; does not crawl |
CCBot | Common Crawl | Open data | Open repository many AI labs use as training input |
What 103 publishers actually publish
Sitemaps are universal. AI bot directives are common. The new agent-discovery files (llms.txt, MCP well-known) are not.
/llms.txt/.well-known/mcp.jsonAI policy is performative, not differentiated
Of 103 audited publishers, only the highlighted 8 apply non-identical rules to training and citation crawlers. Most publishers are turning away citation traffic by accident.
8 of 103
differentiate
The same rule that blocks GPTBot also blocks OAI-SearchBot. Publishers with a deliberate split keep citation visibility while opting out of training.
Each square = one publisher. n = 103.
The publishing-group story
Median SARI score by cohort. culture-entertainment leads at 69; platform sits at 20. Higher is more agent-legible.
Top 10 and Bottom 10
Range: 15 to 81. Median: 57. Five of the top ten are Vox Media properties; the bottom mixes platform-constrained publishers, paywalled titans, and design-led indies that crowd out structured data.
↑ Top performers
↓ Underperformers
+42 points, one sprint
Median SARI is 57 of 100 across the audited cohort. Five fixes lift a typical site to 99. Ranked by point recovery.
Infographic SourcesShow referencesHide references
The bot taxonomy is sourced verbatim from each operator's official documentation. The audit dataset (103 publisher sites, 270 sampled articles) is published in full alongside the companion article.
Bot taxonomy: GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot.
Bot taxonomy: ClaudeBot, Claude-User, Claude-SearchBot. Confirms Claude-Web is no longer documented.
Bot taxonomy: PerplexityBot, Perplexity-User.
Bot taxonomy: Googlebot, Google-Extended, Google-CloudVertexBot.
Bot taxonomy: Applebot, Applebot-Extended.
Bot taxonomy: CCBot.
EU AI Act dates: 2 August 2025 GPAI obligations, 2 August 2026 Article 50 transparency rules.
The next research drop, in your inbox
One email when the next vertical wave (e-commerce, law, SaaS) ships. No filler.