AUTOMATIONSWITCH
Original Research · 103 Publisher Sites

Most Publishers Are Invisible
to AI Assistants

We audited 103 publisher sites against the Scaletific Agent-Readability Index. Five fixes lift any site by roughly forty points in a single sprint.

6of 103publishers have an llms.txt file
49.9of 100mean SARI score across the cohort
8of 75AI bot directives that distinguish training from citation

Every AI crawler the audit recognises

Each row is sourced from the operator's own documentation. Training crawlers feed model weights. Citation crawlers feed live answers.

User-AgentOperatorTypeFunction
GPTBotOpenAITrainingBuilds dataset for future GPT models
OAI-SearchBotOpenAICitationSurfaces sites in ChatGPT search answers
ChatGPT-UserOpenAIUser-triggeredLive fetch when a ChatGPT user asks for a URL
OAI-AdsBotOpenAIAd validationValidates ad landing pages submitted to ChatGPT
ClaudeBotAnthropicTrainingBuilds dataset for Claude models
Claude-UserAnthropicUser-triggeredLive fetch when a Claude user asks a question
Claude-SearchBotAnthropicCitationImproves Claude's search result quality
PerplexityBotPerplexityCitationSurfaces sites in Perplexity answers
Perplexity-UserPerplexityUser-triggeredLive fetch on user request
GooglebotGoogleSearch indexPowers Google Search; output feeds AI Overviews
Google-ExtendedGoogleTrainingOpts the site out of Gemini training and grounding
Google-CloudVertexBotGoogleSite-owner-requestedCrawls for site-owner-built Vertex AI Agents
ApplebotAppleSearch + AIPowers Spotlight, Siri, Safari; data may train AI
Applebot-ExtendedAppleTraining opt-out signalMetadata-only; does not crawl
CCBotCommon CrawlOpen dataOpen repository many AI labs use as training input

What 103 publishers actually publish

Sitemaps are universal. AI bot directives are common. The new agent-discovery files (llms.txt, MCP well-known) are not.

Sitemap
97.1%
Any AI bot directive
72.8%
/llms.txt
5.8%
/.well-known/mcp.json
4.9%

AI policy is performative, not differentiated

Of 103 audited publishers, only the highlighted 8 apply non-identical rules to training and citation crawlers. Most publishers are turning away citation traffic by accident.

8 of 103
differentiate

The same rule that blocks GPTBot also blocks OAI-SearchBot. Publishers with a deliberate split keep citation visibility while opting out of training.

DifferentiatedTraining and citation crawlers receive non-identical rules
8
Blanket / partialSame directive applied across every AI crawler addressed
67
SilentNo AI bot directive in robots.txt
28

Each square = one publisher. n = 103.

The publishing-group story

Median SARI score by cohort. culture-entertainment leads at 69; platform sits at 20. Higher is more agent-legible.

culture-entertainment
n=5
69.0
vertical-travel
n=3
67.0
industry-trade
n=2
62.5
vertical-sports
n=1
62.0
top-tier-news
n=19
60.5
tech
n=14
60.5
business-finance
n=9
57.3
vertical-marketing
n=4
47.5
newsletter-hybrid
n=2
44.9
reviews-service
n=9
40.0
vertical-food
n=5
39.3
vertical-health
n=4
37.6
vertical-science
n=5
32.7
indie-longform
n=8
30.5
platform
n=3
20.0

Top 10 and Bottom 10

Range: 15 to 81. Median: 57. Five of the top ten are Vox Media properties; the bottom mixes platform-constrained publishers, paywalled titans, and design-led indies that crowd out structured data.

↑ Top performers

01
Polygon
culture-entertainment
81.0
02
Pocket-lint
reviews-service
79.7
03
Seeking Alpha
business-finance
75.0
04
The Verge
tech
75.0
05
Eater
vertical-food
74.0
06
Bloomberg
top-tier-news
73.0
07
Vox
top-tier-news
73.0
08
Marketing Brew
vertical-marketing
71.0
09
Morning Brew
newsletter-hybrid
71.0
10
ZDNet
tech
71.0

↓ Underperformers

94
Smithsonian Magazine
vertical-science
15.0
95
Rtings
reviews-service
15.0
96
Quartz
business-finance
15.0
97
Substack
platform
16.7
98
Longreads
indie-longform
18.0
99
The Hustle
newsletter-hybrid
18.7
100
The Wall Street Journal
top-tier-news
20.0
101
Quanta Magazine
vertical-science
20.0
102
The Pudding
indie-longform
20.0
103
Mayo Clinic
vertical-health
20.0

+42 points, one sprint

Median SARI is 57 of 100 across the audited cohort. Five fixes lift a typical site to 99. Ranked by point recovery.

050100
+10
Add /llms.txt at the domain root
1 file, half a day · missing in 94% of audited publishers
+10
Differentiate AI bot policy in robots.txt
1 paragraph of explicit rules · missing in 92% of audited publishers
+12
Article JSON-LD: structured author, dual dates, publisher logo
CMS template change · missing in 31% of articles
+5
Add id attributes to H2 and H3 headings
Renderer or build step · missing in 89% of articles
+5
Connect publisher Organization to its sameAs
Schema field, 2+ entries · missing in 91% of audited publishers
+42
Total points recoverable in a single sprint
5799 of 100
Infographic Sources7 referencesShow referencesHide references

The bot taxonomy is sourced verbatim from each operator's official documentation. The audit dataset (103 publisher sites, 270 sampled articles) is published in full alongside the companion article.

  1. Overview of OpenAI Crawlers
    OpenAIOperator Documentation

    Bot taxonomy: GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot.

  2. Does Anthropic crawl data from the web, and how can site owners block the crawler?
    AnthropicOperator Documentation

    Bot taxonomy: ClaudeBot, Claude-User, Claude-SearchBot. Confirms Claude-Web is no longer documented.

  3. Perplexity Bots Documentation
    PerplexityOperator Documentation

    Bot taxonomy: PerplexityBot, Perplexity-User.

  4. Google's Common Crawlers
    GoogleOperator Documentation

    Bot taxonomy: Googlebot, Google-Extended, Google-CloudVertexBot.

  5. About Applebot
    AppleOperator Documentation

    Bot taxonomy: Applebot, Applebot-Extended.

  6. About CCBot
    Common CrawlOperator Documentation

    Bot taxonomy: CCBot.

  7. AI Act
    European CommissionRegulatory

    EU AI Act dates: 2 August 2025 GPAI obligations, 2 August 2026 Article 50 transparency rules.

The next research drop, in your inbox

One email when the next vertical wave (e-commerce, law, SaaS) ships. No filler.