AI Crawler Access Policy: robots.txt, GPTBot, OAI-SearchBot, Google-Extended
AI crawler policy is now a real site-operations decision. A public site may want ordinary Google visibility, ChatGPT search inclusion, assistant-user fetching, and product discovery. The same site may not want every page used for model training, scraped by unknown bots, or fetched so aggressively that logs and analytics become useless.
The mistake is copying a giant robots.txt blocklist without naming the business goal. A crawler policy should answer five questions:
- Which public pages should be easy to discover?
- Which bots are useful for search, answers, product discovery, or user-triggered research?
- Which crawlers are only useful for training or broad data collection?
- Which pages need authentication,
noindex, removal, or WAF rules instead of robots directives? - How will the team measure whether the policy is working?
Quick answer
Section titled “Quick answer”For most public product, publisher, and documentation sites, start with a split policy instead of an all-or-nothing policy.
| Bot or control | Primary job | Default posture for public decision pages | Watch out for |
|---|---|---|---|
| Googlebot and Google common crawlers | Search indexing and Google product crawls | Keep important public pages crawlable | Do not use robots.txt as a privacy boundary |
| Google-Extended | Control token for Gemini and Vertex AI training or grounding use, not a separate request user agent | Decide separately from Google Search crawling | Blocking Google-Extended does not remove pages from Google Search |
| OAI-SearchBot | OpenAI search crawler for ChatGPT search features | Usually allow for public pages that deserve discovery | Blocking it can reduce inclusion in ChatGPT search answers |
| GPTBot | OpenAI crawler for content that may be used in foundation-model training | Decide based on publisher rights, licensing, and content strategy | It is independent from OAI-SearchBot policy |
| ChatGPT-User | User-triggered visits from ChatGPT or Custom GPT actions | Treat as a user-style fetch, then protect sensitive data with auth and server controls | OpenAI says robots.txt rules may not apply because the action is user-initiated |
| CCBot | Common Crawl crawler for the public crawl corpus | Decide by licensing and public-data posture | Broad blocks may reduce downstream research reuse |
| Cloudflare AI Crawl Control and Content Signals | Visibility, policy expression, and enforcement support | Use for monitoring, crawler-specific rules, and explicit content-use preferences | Content signals express preferences; they are not technical enforcement by themselves |
The practical policy is:
- allow search crawlers to reach public pages that should be discoverable;
- separate AI search crawlers from training crawlers where providers allow it;
- keep private, account, admin, pricing-experiment, staging, and customer pages behind authentication;
- monitor logs before and after any policy change;
- never use crawler policy as a substitute for useful page content.
Decision matrix by site type
Section titled “Decision matrix by site type”| Site type | Strong default | Why |
|---|---|---|
| B2B SaaS product site | Allow Googlebot and OAI-SearchBot on product, comparison, pricing, docs, and policy pages; review GPTBot and CCBot by company policy | Discovery and comparison pages need to be understood by both buyers and assistants |
| Publisher or analyst site | Keep high-value public pages crawlable for search; use Google-Extended, GPTBot, CCBot, licensing, or pay-per-crawl decisions based on content rights | Publishers need visibility, but may need stronger control over training and reuse |
| Ecommerce or marketplace site | Allow public product, category, policy, and feed-support pages; protect checkout, account, cart, and inventory admin paths | Product discovery benefits from clean public facts, while transactional surfaces need control |
| Technical documentation site | Allow stable public docs and changelog pages; block or authenticate internal, beta, and customer-specific docs | Assistants can help developers find docs, but stale or private docs create support risk |
| Internal knowledge base | Do not rely on robots.txt; require authentication and access control | Robots.txt is only a request instruction, not a security boundary |
The layer model
Section titled “The layer model”Crawler policy becomes clearer when each rule is assigned to a layer.
| Layer | Question | Better control |
|---|---|---|
| Discovery | Should this public page be found and linked? | Allow search and AI search crawlers, keep sitemap current |
| Training preference | May this content be used to train or improve models? | GPTBot, Google-Extended, CCBot, licensing, Content Signals |
| User-triggered access | Can a user ask an assistant to fetch this page? | Authentication, paywall, rate limits, server-side authorization |
| Sensitive data | Could this page expose private information if fetched? | Authentication, noindex, removal, access control, WAF |
| Enforcement | Are bots following the policy? | Logs, bot verification, edge rules, AI Crawl Control, incident review |
| Measurement | Did access produce useful reader behavior? | Server logs, analytics, CRM notes, conversion quality review |
This prevents the most common failure: using a training-crawler rule to solve a security problem, or using a search-crawler block to express a licensing preference.
robots.txt is not a security boundary
Section titled “robots.txt is not a security boundary”Google’s Search Central documentation is blunt about the limitation: robots.txt can manage crawler access, but it should not be used to hide pages from Search or secure sensitive content. A disallowed URL can still appear if other pages link to it, and different crawlers may interpret or ignore rules differently.
Use stronger controls when a URL should not be public:
- require login or signed access;
- remove the URL;
- return the right status code;
- use
noindexwhere appropriate and crawlable; - block abusive traffic at the edge;
- avoid placing secrets, customer data, or private docs on public URLs.
For public decision pages, robots.txt should express crawl preference. It should not carry the burden of privacy.
OpenAI crawler policy
Section titled “OpenAI crawler policy”OpenAI currently separates important crawler roles.
| OpenAI user agent | Practical policy question |
|---|---|
| OAI-SearchBot | Do we want public pages to be eligible for ChatGPT search answers? |
| GPTBot | Do we allow this content to be used for training OpenAI foundation models? |
| ChatGPT-User | Are user-triggered fetches safe because the page itself has the right authentication and public-data boundary? |
| OAI-AdsBot | Are ad landing pages safe, compliant, and aligned with submitted advertising use? |
For a public product or publisher site, the common split is to allow OAI-SearchBot on pages that should be found while making a separate GPTBot decision. That split matters because a site can want assistant search visibility without granting the same preference for model training.
Example pattern for a public product site that wants ChatGPT search inclusion but wants to reserve training rights:
User-agent: *Allow: /
User-agent: OAI-SearchBotAllow: /
User-agent: GPTBotDisallow: /
Sitemap: https://example.com/sitemap-index.xmlDo not copy this blindly. If your business wants broad reuse of public educational material, the GPTBot decision may be different. The important thing is to make the split deliberately and monitor the outcome.
Google crawler policy
Section titled “Google crawler policy”Google has two separate ideas that teams often mix together:
- Googlebot and related crawlers for Search and Google products;
- Google-Extended as a robots.txt control token for whether crawled content may be used for Gemini and Vertex AI model training or grounding.
Google’s crawler documentation says Google-Extended is not a separate HTTP request user agent. It is a robots.txt token used in a control capacity. Google also says Google-Extended does not affect inclusion in Google Search and is not a Google Search ranking signal.
That means a site can keep Google Search crawling open while expressing a separate Google-Extended preference.
Example pattern for a publisher that wants Search crawling but does not want Google-Extended uses for the full site:
User-agent: *Allow: /
User-agent: Google-ExtendedDisallow: /
Sitemap: https://example.com/sitemap-index.xmlFor a product site, be careful with broad blocks. The stronger long-term move is to make public pages accurate, crawlable, and useful, while protecting private or low-quality surfaces through stronger controls.
Content Signals and Cloudflare controls
Section titled “Content Signals and Cloudflare controls”Cloudflare’s Content Signals Policy adds a way to express preferences for uses such as search, AI input, and AI training inside robots.txt. Cloudflare also documents AI Crawl Control for seeing which AI services access content, setting crawler-specific policies, checking robots.txt compliance, and exploring pay-per-crawl.
Content Signals can be useful as a public rights and preference statement:
User-agent: *Content-Signal: search=yes, ai-train=noAllow: /Treat this as a policy signal, not a complete technical barrier. Pair it with:
- bot and request logs;
- crawler-specific allow or block rules;
- WAF or bot-management rules for abusive behavior;
- authentication for private data;
- a monthly review of which pages are being fetched.
A practical 14-day implementation plan
Section titled “A practical 14-day implementation plan”Days 1-2: inventory public surfaces
Section titled “Days 1-2: inventory public surfaces”List page groups instead of individual URLs first:
- homepage and hubs;
- product and pricing pages;
- comparison and alternatives pages;
- documentation and changelogs;
- policies and support pages;
- blog or market-signal pages;
- account, checkout, admin, staging, and customer-specific paths.
Mark each group as public discovery, public but rights-sensitive, private, or low-value.
Days 3-5: choose crawler posture by group
Section titled “Days 3-5: choose crawler posture by group”For each group, decide:
| Question | Good answer |
|---|---|
| Should Google Search crawl this? | Yes for public source and product pages; no for private or low-value utility paths |
| Should ChatGPT search discover this? | Yes for public pages that can answer a real reader question |
| Should training crawlers access this? | Depends on publisher rights, company policy, and licensing posture |
| Should user-triggered assistant fetches be safe? | Yes only if the page is safe for any unauthenticated user to view |
| Should the page appear in sitemaps? | Yes for canonical public pages that deserve maintenance |
Days 6-8: update robots and edge rules
Section titled “Days 6-8: update robots and edge rules”Make the smallest useful change. Avoid giant vendor lists if you cannot maintain them.
For a discovery-first product site, the policy may stay simple:
User-agent: *Allow: /
Sitemap: https://example.com/sitemap-index.xmlFor a rights-sensitive publisher, add targeted rules and content signals only after deciding what each rule means:
User-agent: *Content-Signal: search=yes, ai-train=noAllow: /
User-agent: GPTBotDisallow: /
User-agent: Google-ExtendedDisallow: /
User-agent: CCBotDisallow: /
Sitemap: https://example.com/sitemap-index.xmlFor private paths, use real access control:
User-agent: *Disallow: /account/Disallow: /admin/Disallow: /checkout/Then verify that those paths are also protected by authentication or server rules. The Disallow lines are not enough by themselves.
Days 9-11: monitor logs and coverage
Section titled “Days 9-11: monitor logs and coverage”Check:
- status codes for important public pages;
- requests from known search and AI user agents;
- sitemap fetches;
- blocked paths that are still requested often;
- user-triggered fetch behavior;
- differences between bot activity and real sessions.
This connects the policy to the AI crawler referral measurement layer.
Days 12-14: refresh pages that earn access
Section titled “Days 12-14: refresh pages that earn access”If a page is important enough to allow and monitor, it should be useful enough to maintain.
Prioritize pages that:
- answer a specific buyer or operator question;
- include source notes and reviewed dates;
- have clear fit and poor-fit cases;
- link to adjacent implementation pages;
- avoid stale pricing, model, availability, or policy claims.
Crawler policy can make access clearer. It cannot make a weak page valuable.
Failure modes
Section titled “Failure modes”| Failure | Why it hurts | Better move |
|---|---|---|
| Blocking every AI crawler because it feels safer | The site may lose useful discovery paths without actually securing private data | Separate search, training, user-triggered fetching, and private access |
| Allowing every bot because visibility sounds good | Logs become noisy and rights-sensitive content may be reused in ways the team did not approve | Define a crawler register and review it monthly |
| Relying on robots.txt for secrets | Robots directives are not security controls | Use authentication, authorization, and removal |
| Blocking important resources | Crawlers may understand pages less accurately | Keep CSS, JS, images, and public assets accessible unless there is a clear reason |
| Updating robots.txt without measurement | The team cannot tell whether the change helped or harmed discovery | Compare logs, search coverage, assistant referrals, and qualified sessions before and after |
| Treating Content Signals as enforcement | Some crawlers may ignore preference signals | Pair signals with edge rules, legal policy, and bot controls where needed |
Compare next
Section titled “Compare next”Source notes checked July 2, 2026
Section titled “Source notes checked July 2, 2026”| Source | Signal used |
|---|---|
| OpenAI crawler documentation | OpenAI separates OAI-SearchBot, GPTBot, OAI-AdsBot, and ChatGPT-User roles; OAI-SearchBot and GPTBot can be managed independently in robots.txt. |
| Google common crawlers documentation | Google-Extended is a robots.txt token rather than a separate HTTP user agent and does not affect Google Search inclusion or ranking. |
| Google robots.txt guide | robots.txt manages crawler access but is not a secure way to hide pages; disallowed URLs may still appear if linked elsewhere. |
| Cloudflare AI Crawl Control | Cloudflare provides AI crawler visibility, crawler-specific controls, robots.txt compliance monitoring, and pay-per-crawl options. |
| Cloudflare Content Signals Policy | Content Signals can express preferences for search, AI input, and AI training in robots.txt, but should be paired with enforcement controls where needed. |
| Common Crawl CCBot documentation | CCBot identifies itself in its user agent and can be controlled with a CCBot robots.txt rule. |