This article provides an overview of all bots recognised by AI Scrape Protect, divided into the four categories used in the plugin. Each bot includes an explanation of what it does and whether you would want to block it. For more information about the plugin, visit the AI Scrape Protect plugin page.

Version: 5.0
Published: 23 May 2026

Search Engines

These bots are allowed by default. They are responsible for regular search engine indexing. You can disable them individually, but doing so will directly affect your visibility in the relevant search engine.

Googlebot

Google’s primary web crawler, responsible for indexing your site in Google Search. Blocking it will remove your site from Google search results.

Googlebot-Image

Crawls images for Google Images. Blocking it will remove your images from Google Images.

Googlebot-News

Indexes news articles for Google News.

Google-PageSpeed

Used by Google PageSpeed Insights for performance analysis of your site.

Google-Site-Verification

Used by Google Search Console to verify ownership of your site.

Lighthouse

Google’s automated tool for analysing performance, accessibility, and SEO.

Bingbot

Microsoft’s primary web crawler for Bing Search. Blocking it will remove your site from Bing search results.

AI Training

These bots collect web content to train AI models. They are blocked by default. Blocking via robots.txt is effective for most of these bots, though some may ignore robots.txt directives.

GPTBot

OpenAI’s primary training crawler. Collects web content to train GPT models such as ChatGPT. Respects robots.txt.

GPTBot-Preview

An experimental pre-release variant of GPTBot used by OpenAI to test new data collection methods for future models.

ClaudeBot

Anthropic’s primary training crawler. Collects web content to train Claude. Respects robots.txt.

ClaudeResearchBot

Deployed by Anthropic’s research team to collect datasets specifically for studying safe and responsible AI systems.

AnthropicBot

A crawler operated by Anthropic to collect data for training Claude. Some reports indicate this bot may ignore robots.txt directives.

CCBot

Common Crawl’s web crawler. Builds a publicly available open dataset used as training data by most major language models. Respects robots.txt.

Meta-ExternalAgent

Meta’s AI training and indexing crawler for products across Facebook, Instagram, and WhatsApp. Not to be confused with facebookexternalhit, Meta’s link preview fetcher, which should not be blocked.

Meta-ExternalFetcher

Used by Meta to fetch external content for AI indexing purposes. Distinct from facebookexternalhit, which handles link previews and should not be blocked.

cohere-ai

Cohere’s crawler, used to collect data for training natural language processing models.

cohere-training-data-crawler

A dedicated Cohere crawler focused specifically on collecting training datasets for their AI models.

Amazonbot

Amazon’s crawler used for indexing and AI model training, including Amazon Nova and Alexa. Respects robots.txt.

Amazon-AI

Amazon’s AI-specific crawler used to collect web content in support of services such as Alexa and recommendation systems.

Applebot-Extended

Apple’s opt-out control token for AI training. Blocking this prevents your content from being used to train Apple Intelligence and other Apple generative AI models. Activity has surged significantly since the rollout of Apple Intelligence in 2026.

AI2Bot

Developed by the Allen Institute for AI for academic research and AI model development.

Ai2Bot-Dolma

A variant of AI2Bot used to collect data for the Dolma open dataset, which is used to train open-source language models.

StableDiffusionBot

Crawls web content, particularly images, for training Stable Diffusion and other generative AI image models.

img2dataset

A tool used to download and process large image datasets for AI and machine learning training.

TurnitinBot

Used by Turnitin to collect content for plagiarism detection databases and AI training. Blocking it prevents your texts from being included in their detection systems.

DataForSeoBot

Crawls websites to collect large datasets used for SEO analysis, data resale, and AI model training.

Diffbot

An AI-powered web scraper that structures web data for knowledge graph construction and various AI applications.

magpie-crawler

Focuses on content aggregation and building training datasets, primarily for social listening and AI purposes.

sentibot

Likely used for sentiment analysis data collection and AI model training.

Omgilibot / Omgili

Scrapes content from forums and discussion platforms, primarily for market research and AI training datasets.

Webzio-Extended / webzio

Crawlers from Webz.io used to gather web data for content analysis and AI training datasets.

ImagesiftBot

Specialises in crawling and collecting images, potentially for AI training or visual content analysis.

PanguBot

Linked to Huawei’s Pangu AI models. Collects web data for large language model training.

ErnieBot

Baidu’s AI model crawler, used to collect training data for Ernie Bot, Baidu’s large language model.

DeepseekBot

Crawler used by DeepSeek to collect web data for training their AI language models.

ChatGLM-Spider

Crawler used by Zhipu AI to collect training data for ChatGLM, their large language model.

AIMatrixCrawler

Crawls web content for AI matrix and machine learning training purposes.

FirecrawlAgent

A scraper-as-a-service platform used by many different clients and AI applications to extract structured web content. Respects robots.txt but is multi-tenant, meaning many different operators drive the same crawler.

Timpibot

Indexes and fetches data for search and AI applications.

YouBot

Crawler used by You.com to collect content for AI-powered search and model training.

KomoBot

Used by Komo Search to gather data for AI-enhanced search and training purposes.

iAskAI-Crawler

Collects web content to generate answers for the iAsk.ai search and question-answering platform.

PiplBot

Gathers information for Pipl’s people search and identity verification services.

AI Search & Answers

These bots retrieve content for AI-driven search results or direct answers to users. They are blocked by default, but allowing them may help your site appear in AI search engines such as ChatGPT Search or Perplexity.

PerplexityBot

Perplexity AI’s primary crawler used to build its retrieval index for AI-generated search answers. Known for high crawl frequency and a focus on news and high-authority content. Provides no referral traffic back to your site.

Perplexity-User

Fetches pages in real time when a user submits a query to Perplexity AI. Blocking this prevents your content from appearing in Perplexity answers.

OAI-SearchBot

OpenAI’s retrieval crawler for ChatGPT Search. Separate from GPTBot. Allowing this bot while blocking GPTBot lets your site appear in ChatGPT Search without contributing to model training.

ChatGPT-User

Fetches pages in real time when a ChatGPT user requests up-to-date information. Blocking this prevents your content from being retrieved for live ChatGPT answers.

Claude-User

Fetches pages in real time when a Claude user requests current information. Blocking this prevents your content from being retrieved for live Claude answers. Does not affect search engine indexing or ranking.

Claude-SearchBot

Anthropic’s retrieval indexing crawler for Claude’s search capabilities. Separate from ClaudeBot, which is used for training.

Google-Extended

Google’s opt-out control token for AI training and Gemini products. Blocking this prevents your content from being used to train Google’s AI models and from appearing in AI-generated summaries, while leaving regular Google Search indexing unaffected.

GoogleOther

A Google crawler used for internal purposes such as AI research and training, not for regular search indexing. Blocking this has no effect on your Google Search rankings.

Google-Agent

Google’s user-triggered fetcher, used when a user asks Gemini or Google AI Overviews for current information. Note: this bot does not respect robots.txt. Blocking via robots.txt is therefore ineffective.

DuckAssistBot

DuckDuckGo’s bot for DuckAssist, their AI-driven answer and content summary feature.

OpenAIContentCrawler

Gathers data specifically for OpenAI’s content-related tools and retrieval-augmented generation.

General Indexing

These bots are primarily used for search engine indexing but have known AI or data collection applications. They are allowed by default. Blocking them may affect your visibility in the relevant search engines.

YandexBot

Yandex’s primary web crawler for search indexing. Also used for AI applications within Yandex services. Blocking it will remove your site from Yandex Search.

Baiduspider

Baidu’s primary web crawler for indexing in Baidu Search, the dominant search engine in China. Also feeds Baidu’s AI products. Blocking it will remove your site from Baidu Search.

Sogou

Sogou’s web crawler for Sogou Search, a major search engine in China now operated by Tencent. Also used for AI data collection.

360Spider

Web crawler used by Qihoo 360’s search engine in China. Also used for AI-related data collection.

PetalBot

Huawei’s web crawler associated with Petal Search. Used for search engine indexing with potential AI data applications.

FacebookBot

Used by Meta for indexing public content for social media search and discovery features. Not to be confused with facebookexternalhit, which handles link previews and should not be blocked.

Bytespider

Associated with ByteDance, TikTok’s parent company. May collect data for AI and content generation tools. ByteDance has no official documentation page for this crawler.

Grok / GrokAI / XAI / XBot

Several user agent variants associated with xAI, the AI company behind Grok. Used for data collection and AI research. xAI has no official documentation page for these crawlers.

For more information about how AI Scrape Protect works, visit the AI Scrape Protect plugin page.

Cookie	Duration	Description
_GRECAPTCHA	6 months	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
nitroCachedPage	session	Description is currently not available.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.

Understanding the Bots Blocked by AI Scrape Protect