Original Research · 103 Publisher Sites

Most Publishers Are Invisible
to AI Assistants

We audited 103 publisher sites against the Scaletific Agent-Readability Index. Five fixes lift any site by roughly forty points in a single sprint.

6of 103publishers have an llms.txt file

49.9of 100mean SARI score across the cohort

8of 75AI bot directives that distinguish training from citation

Section 01 · The Crawler Map

Every AI crawler the audit recognises

Each row is sourced from the operator's own documentation. Training crawlers feed model weights. Citation crawlers feed live answers.

User-Agent	Operator	Type	Function
`GPTBot`	OpenAI	Training	Builds dataset for future GPT models
`OAI-SearchBot`	OpenAI	Citation	Surfaces sites in ChatGPT search answers
`ChatGPT-User`	OpenAI	User-triggered	Live fetch when a ChatGPT user asks for a URL
`OAI-AdsBot`	OpenAI	Ad validation	Validates ad landing pages submitted to ChatGPT
`ClaudeBot`	Anthropic	Training	Builds dataset for Claude models
`Claude-User`	Anthropic	User-triggered	Live fetch when a Claude user asks a question
`Claude-SearchBot`	Anthropic	Citation	Improves Claude's search result quality
`PerplexityBot`	Perplexity	Citation	Surfaces sites in Perplexity answers
`Perplexity-User`	Perplexity	User-triggered	Live fetch on user request
`Googlebot`	Google	Search index	Powers Google Search; output feeds AI Overviews
`Google-Extended`	Google	Training	Opts the site out of Gemini training and grounding
`Google-CloudVertexBot`	Google	Site-owner-requested	Crawls for site-owner-built Vertex AI Agents
`Applebot`	Apple	Search + AI	Powers Spotlight, Siri, Safari; data may train AI
`Applebot-Extended`	Apple	Training opt-out signal	Metadata-only; does not crawl
`CCBot`	Common Crawl	Open data	Open repository many AI labs use as training input

Section 02 · Discovery Signals

What 103 publishers actually publish

Sitemaps are universal. AI bot directives are common. The new agent-discovery files (llms.txt, MCP well-known) are not.

Sitemap

97.1%

Any AI bot directive

72.8%

/llms.txt

5.8%

/.well-known/mcp.json

4.9%

Section 03 · The Big Finding

AI policy is performative, not differentiated

Of 103 audited publishers, only the highlighted 8 apply non-identical rules to training and citation crawlers. Most publishers are turning away citation traffic by accident.

8 of 103
differentiate

The same rule that blocks GPTBot also blocks OAI-SearchBot. Publishers with a deliberate split keep citation visibility while opting out of training.

DifferentiatedTraining and citation crawlers receive non-identical rules

Blanket / partialSame directive applied across every AI crawler addressed

SilentNo AI bot directive in robots.txt

Each square = one publisher. n = 103.

Section 04 · Cohort Story

The publishing-group story

Median SARI score by cohort. culture-entertainment leads at 69; platform sits at 20. Higher is more agent-legible.

culture-entertainment

n=5

69.0

vertical-travel

n=3

67.0

industry-trade

n=2

62.5

vertical-sports

n=1

62.0

top-tier-news

n=19

60.5

tech

n=14

60.5

business-finance

n=9

57.3

vertical-marketing

n=4

47.5

newsletter-hybrid

n=2

44.9

reviews-service

n=9

40.0

vertical-food

n=5

39.3

vertical-health

n=4

37.6

vertical-science

n=5

32.7

indie-longform

n=8

30.5

platform

n=3

20.0

Section 05 · The Leaderboard

Top 10 and Bottom 10

Range: 15 to 81. Median: 57. Five of the top ten are Vox Media properties; the bottom mixes platform-constrained publishers, paywalled titans, and design-led indies that crowd out structured data.

↑ Top performers

Polygon

culture-entertainment

81.0

Pocket-lint

reviews-service

79.7

Seeking Alpha

business-finance

75.0

The Verge

tech

75.0

Eater

vertical-food

74.0

Bloomberg

top-tier-news

73.0

Vox

top-tier-news

73.0

Marketing Brew

vertical-marketing

71.0

Morning Brew

newsletter-hybrid

71.0

ZDNet

tech

71.0

↓ Underperformers

Smithsonian Magazine

vertical-science

15.0

Rtings

reviews-service

15.0

Quartz

business-finance

15.0

Substack

platform

16.7

Longreads

indie-longform

18.0

The Hustle

newsletter-hybrid

18.7

100

The Wall Street Journal

top-tier-news

20.0

101

Quanta Magazine

vertical-science

20.0

102

The Pudding

indie-longform

20.0

103

Mayo Clinic

vertical-health

20.0

Section 06 · The Action

+42 points, one sprint

Median SARI is 57 of 100 across the audited cohort. Five fixes lift a typical site to 99. Ranked by point recovery.

050100

+10

Add /llms.txt at the domain root

1 file, half a day · missing in 94% of audited publishers

+10

Differentiate AI bot policy in robots.txt

1 paragraph of explicit rules · missing in 92% of audited publishers

+12

Article JSON-LD: structured author, dual dates, publisher logo

CMS template change · missing in 31% of articles

Add id attributes to H2 and H3 headings

Renderer or build step · missing in 89% of articles

Connect publisher Organization to its sameAs

Schema field, 2+ entries · missing in 91% of audited publishers

+42

Total points recoverable in a single sprint

57 → 99 of 100

Infographic Sources7 referencesShow referencesHide references

The bot taxonomy is sourced verbatim from each operator's official documentation. The audit dataset (103 publisher sites, 270 sampled articles) is published in full alongside the companion article.

Overview of OpenAI Crawlers
OpenAIOperator Documentation
Bot taxonomy: GPTBot, OAI-SearchBot, ChatGPT-User, OAI-AdsBot.
Does Anthropic crawl data from the web, and how can site owners block the crawler?
AnthropicOperator Documentation
Bot taxonomy: ClaudeBot, Claude-User, Claude-SearchBot. Confirms Claude-Web is no longer documented.
Perplexity Bots Documentation
PerplexityOperator Documentation
Bot taxonomy: PerplexityBot, Perplexity-User.
Google's Common Crawlers
GoogleOperator Documentation
Bot taxonomy: Googlebot, Google-Extended, Google-CloudVertexBot.
About Applebot
AppleOperator Documentation
Bot taxonomy: Applebot, Applebot-Extended.
About CCBot
Common CrawlOperator Documentation
Bot taxonomy: CCBot.
AI Act
European CommissionRegulatory
EU AI Act dates: 2 August 2025 GPAI obligations, 2 August 2026 Article 50 transparency rules.

The next research drop, in your inbox

One email when the next vertical wave (e-commerce, law, SaaS) ships. No filler.

Every AI crawler the audit recognises

What 103 publishers actually publish

AI policy is performative, not differentiated

8 of 103differentiate

The publishing-group story

Top 10 and Bottom 10

↑ Top performers

↓ Underperformers

+42 points, one sprint

The next research drop, in your inbox

8 of 103
differentiate