Skip to content

AI Crawler Access Policy: robots.txt, GPTBot, OAI-SearchBot, Google-Extended

AI crawler policy is now a real site-operations decision. A public site may want ordinary Google visibility, ChatGPT search inclusion, assistant-user fetching, and product discovery. The same site may not want every page used for model training, scraped by unknown bots, or fetched so aggressively that logs and analytics become useless.

The mistake is copying a giant robots.txt blocklist without naming the business goal. A crawler policy should answer five questions:

  1. Which public pages should be easy to discover?
  2. Which bots are useful for search, answers, product discovery, or user-triggered research?
  3. Which crawlers are only useful for training or broad data collection?
  4. Which pages need authentication, noindex, removal, or WAF rules instead of robots directives?
  5. How will the team measure whether the policy is working?

For most public product, publisher, and documentation sites, start with a split policy instead of an all-or-nothing policy.

Bot or controlPrimary jobDefault posture for public decision pagesWatch out for
Googlebot and Google common crawlersSearch indexing and Google product crawlsKeep important public pages crawlableDo not use robots.txt as a privacy boundary
Google-ExtendedControl token for Gemini and Vertex AI training or grounding use, not a separate request user agentDecide separately from Google Search crawlingBlocking Google-Extended does not remove pages from Google Search
OAI-SearchBotOpenAI search crawler for ChatGPT search featuresUsually allow for public pages that deserve discoveryBlocking it can reduce inclusion in ChatGPT search answers
GPTBotOpenAI crawler for content that may be used in foundation-model trainingDecide based on publisher rights, licensing, and content strategyIt is independent from OAI-SearchBot policy
ChatGPT-UserUser-triggered visits from ChatGPT or Custom GPT actionsTreat as a user-style fetch, then protect sensitive data with auth and server controlsOpenAI says robots.txt rules may not apply because the action is user-initiated
CCBotCommon Crawl crawler for the public crawl corpusDecide by licensing and public-data postureBroad blocks may reduce downstream research reuse
Cloudflare AI Crawl Control and Content SignalsVisibility, policy expression, and enforcement supportUse for monitoring, crawler-specific rules, and explicit content-use preferencesContent signals express preferences; they are not technical enforcement by themselves

The practical policy is:

  • allow search crawlers to reach public pages that should be discoverable;
  • separate AI search crawlers from training crawlers where providers allow it;
  • keep private, account, admin, pricing-experiment, staging, and customer pages behind authentication;
  • monitor logs before and after any policy change;
  • never use crawler policy as a substitute for useful page content.
Site typeStrong defaultWhy
B2B SaaS product siteAllow Googlebot and OAI-SearchBot on product, comparison, pricing, docs, and policy pages; review GPTBot and CCBot by company policyDiscovery and comparison pages need to be understood by both buyers and assistants
Publisher or analyst siteKeep high-value public pages crawlable for search; use Google-Extended, GPTBot, CCBot, licensing, or pay-per-crawl decisions based on content rightsPublishers need visibility, but may need stronger control over training and reuse
Ecommerce or marketplace siteAllow public product, category, policy, and feed-support pages; protect checkout, account, cart, and inventory admin pathsProduct discovery benefits from clean public facts, while transactional surfaces need control
Technical documentation siteAllow stable public docs and changelog pages; block or authenticate internal, beta, and customer-specific docsAssistants can help developers find docs, but stale or private docs create support risk
Internal knowledge baseDo not rely on robots.txt; require authentication and access controlRobots.txt is only a request instruction, not a security boundary

Crawler policy becomes clearer when each rule is assigned to a layer.

LayerQuestionBetter control
DiscoveryShould this public page be found and linked?Allow search and AI search crawlers, keep sitemap current
Training preferenceMay this content be used to train or improve models?GPTBot, Google-Extended, CCBot, licensing, Content Signals
User-triggered accessCan a user ask an assistant to fetch this page?Authentication, paywall, rate limits, server-side authorization
Sensitive dataCould this page expose private information if fetched?Authentication, noindex, removal, access control, WAF
EnforcementAre bots following the policy?Logs, bot verification, edge rules, AI Crawl Control, incident review
MeasurementDid access produce useful reader behavior?Server logs, analytics, CRM notes, conversion quality review

This prevents the most common failure: using a training-crawler rule to solve a security problem, or using a search-crawler block to express a licensing preference.

Google’s Search Central documentation is blunt about the limitation: robots.txt can manage crawler access, but it should not be used to hide pages from Search or secure sensitive content. A disallowed URL can still appear if other pages link to it, and different crawlers may interpret or ignore rules differently.

Use stronger controls when a URL should not be public:

  • require login or signed access;
  • remove the URL;
  • return the right status code;
  • use noindex where appropriate and crawlable;
  • block abusive traffic at the edge;
  • avoid placing secrets, customer data, or private docs on public URLs.

For public decision pages, robots.txt should express crawl preference. It should not carry the burden of privacy.

OpenAI currently separates important crawler roles.

OpenAI user agentPractical policy question
OAI-SearchBotDo we want public pages to be eligible for ChatGPT search answers?
GPTBotDo we allow this content to be used for training OpenAI foundation models?
ChatGPT-UserAre user-triggered fetches safe because the page itself has the right authentication and public-data boundary?
OAI-AdsBotAre ad landing pages safe, compliant, and aligned with submitted advertising use?

For a public product or publisher site, the common split is to allow OAI-SearchBot on pages that should be found while making a separate GPTBot decision. That split matters because a site can want assistant search visibility without granting the same preference for model training.

Example pattern for a public product site that wants ChatGPT search inclusion but wants to reserve training rights:

User-agent: *
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: GPTBot
Disallow: /
Sitemap: https://example.com/sitemap-index.xml

Do not copy this blindly. If your business wants broad reuse of public educational material, the GPTBot decision may be different. The important thing is to make the split deliberately and monitor the outcome.

Google has two separate ideas that teams often mix together:

  • Googlebot and related crawlers for Search and Google products;
  • Google-Extended as a robots.txt control token for whether crawled content may be used for Gemini and Vertex AI model training or grounding.

Google’s crawler documentation says Google-Extended is not a separate HTTP request user agent. It is a robots.txt token used in a control capacity. Google also says Google-Extended does not affect inclusion in Google Search and is not a Google Search ranking signal.

That means a site can keep Google Search crawling open while expressing a separate Google-Extended preference.

Example pattern for a publisher that wants Search crawling but does not want Google-Extended uses for the full site:

User-agent: *
Allow: /
User-agent: Google-Extended
Disallow: /
Sitemap: https://example.com/sitemap-index.xml

For a product site, be careful with broad blocks. The stronger long-term move is to make public pages accurate, crawlable, and useful, while protecting private or low-quality surfaces through stronger controls.

Cloudflare’s Content Signals Policy adds a way to express preferences for uses such as search, AI input, and AI training inside robots.txt. Cloudflare also documents AI Crawl Control for seeing which AI services access content, setting crawler-specific policies, checking robots.txt compliance, and exploring pay-per-crawl.

Content Signals can be useful as a public rights and preference statement:

User-agent: *
Content-Signal: search=yes, ai-train=no
Allow: /

Treat this as a policy signal, not a complete technical barrier. Pair it with:

  • bot and request logs;
  • crawler-specific allow or block rules;
  • WAF or bot-management rules for abusive behavior;
  • authentication for private data;
  • a monthly review of which pages are being fetched.

List page groups instead of individual URLs first:

  • homepage and hubs;
  • product and pricing pages;
  • comparison and alternatives pages;
  • documentation and changelogs;
  • policies and support pages;
  • blog or market-signal pages;
  • account, checkout, admin, staging, and customer-specific paths.

Mark each group as public discovery, public but rights-sensitive, private, or low-value.

For each group, decide:

QuestionGood answer
Should Google Search crawl this?Yes for public source and product pages; no for private or low-value utility paths
Should ChatGPT search discover this?Yes for public pages that can answer a real reader question
Should training crawlers access this?Depends on publisher rights, company policy, and licensing posture
Should user-triggered assistant fetches be safe?Yes only if the page is safe for any unauthenticated user to view
Should the page appear in sitemaps?Yes for canonical public pages that deserve maintenance

Make the smallest useful change. Avoid giant vendor lists if you cannot maintain them.

For a discovery-first product site, the policy may stay simple:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap-index.xml

For a rights-sensitive publisher, add targeted rules and content signals only after deciding what each rule means:

User-agent: *
Content-Signal: search=yes, ai-train=no
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
Sitemap: https://example.com/sitemap-index.xml

For private paths, use real access control:

User-agent: *
Disallow: /account/
Disallow: /admin/
Disallow: /checkout/

Then verify that those paths are also protected by authentication or server rules. The Disallow lines are not enough by themselves.

Check:

  • status codes for important public pages;
  • requests from known search and AI user agents;
  • sitemap fetches;
  • blocked paths that are still requested often;
  • user-triggered fetch behavior;
  • differences between bot activity and real sessions.

This connects the policy to the AI crawler referral measurement layer.

Days 12-14: refresh pages that earn access

Section titled “Days 12-14: refresh pages that earn access”

If a page is important enough to allow and monitor, it should be useful enough to maintain.

Prioritize pages that:

  • answer a specific buyer or operator question;
  • include source notes and reviewed dates;
  • have clear fit and poor-fit cases;
  • link to adjacent implementation pages;
  • avoid stale pricing, model, availability, or policy claims.

Crawler policy can make access clearer. It cannot make a weak page valuable.

FailureWhy it hurtsBetter move
Blocking every AI crawler because it feels saferThe site may lose useful discovery paths without actually securing private dataSeparate search, training, user-triggered fetching, and private access
Allowing every bot because visibility sounds goodLogs become noisy and rights-sensitive content may be reused in ways the team did not approveDefine a crawler register and review it monthly
Relying on robots.txt for secretsRobots directives are not security controlsUse authentication, authorization, and removal
Blocking important resourcesCrawlers may understand pages less accuratelyKeep CSS, JS, images, and public assets accessible unless there is a clear reason
Updating robots.txt without measurementThe team cannot tell whether the change helped or harmed discoveryCompare logs, search coverage, assistant referrals, and qualified sessions before and after
Treating Content Signals as enforcementSome crawlers may ignore preference signalsPair signals with edge rules, legal policy, and bot controls where needed
SourceSignal used
OpenAI crawler documentationOpenAI separates OAI-SearchBot, GPTBot, OAI-AdsBot, and ChatGPT-User roles; OAI-SearchBot and GPTBot can be managed independently in robots.txt.
Google common crawlers documentationGoogle-Extended is a robots.txt token rather than a separate HTTP user agent and does not affect Google Search inclusion or ranking.
Google robots.txt guiderobots.txt manages crawler access but is not a secure way to hide pages; disallowed URLs may still appear if linked elsewhere.
Cloudflare AI Crawl ControlCloudflare provides AI crawler visibility, crawler-specific controls, robots.txt compliance monitoring, and pay-per-crawl options.
Cloudflare Content Signals PolicyContent Signals can express preferences for search, AI input, and AI training in robots.txt, but should be paired with enforcement controls where needed.
Common Crawl CCBot documentationCCBot identifies itself in its user agent and can be controlled with a CCBot robots.txt rule.