This article provides an overview of all bots recognised by AI Scrape Protect, divided into the four categories used in the plugin. Each bot includes an explanation of what it does and whether you would want to block it. For more information about the plugin, visit the AI Scrape Protect plugin page.
Version: 5.0
Published: 23 May 2026
Search Engines
These bots are allowed by default. They are responsible for regular search engine indexing. You can disable them individually, but doing so will directly affect your visibility in the relevant search engine.
Googlebot
Google’s primary web crawler, responsible for indexing your site in Google Search. Blocking it will remove your site from Google search results.
Googlebot-Image
Crawls images for Google Images. Blocking it will remove your images from Google Images.
Googlebot-News
Indexes news articles for Google News.
Google-PageSpeed
Used by Google PageSpeed Insights for performance analysis of your site.
Google-Site-Verification
Used by Google Search Console to verify ownership of your site.
Lighthouse
Google’s automated tool for analysing performance, accessibility, and SEO.
Bingbot
Microsoft’s primary web crawler for Bing Search. Blocking it will remove your site from Bing search results.
AI Training
These bots collect web content to train AI models. They are blocked by default. Blocking via robots.txt is effective for most of these bots, though some may ignore robots.txt directives.
GPTBot
OpenAI’s primary training crawler. Collects web content to train GPT models such as ChatGPT. Respects robots.txt.
GPTBot-Preview
An experimental pre-release variant of GPTBot used by OpenAI to test new data collection methods for future models.
ClaudeBot
Anthropic’s primary training crawler. Collects web content to train Claude. Respects robots.txt.
ClaudeResearchBot
Deployed by Anthropic’s research team to collect datasets specifically for studying safe and responsible AI systems.
AnthropicBot
A crawler operated by Anthropic to collect data for training Claude. Some reports indicate this bot may ignore robots.txt directives.
CCBot
Common Crawl’s web crawler. Builds a publicly available open dataset used as training data by most major language models. Respects robots.txt.
Meta-ExternalAgent
Meta’s AI training and indexing crawler for products across Facebook, Instagram, and WhatsApp. Not to be confused with facebookexternalhit, Meta’s link preview fetcher, which should not be blocked.
Meta-ExternalFetcher
Used by Meta to fetch external content for AI indexing purposes. Distinct from facebookexternalhit, which handles link previews and should not be blocked.
cohere-ai
Cohere’s crawler, used to collect data for training natural language processing models.
cohere-training-data-crawler
A dedicated Cohere crawler focused specifically on collecting training datasets for their AI models.
Amazonbot
Amazon’s crawler used for indexing and AI model training, including Amazon Nova and Alexa. Respects robots.txt.
Amazon-AI
Amazon’s AI-specific crawler used to collect web content in support of services such as Alexa and recommendation systems.
Applebot-Extended
Apple’s opt-out control token for AI training. Blocking this prevents your content from being used to train Apple Intelligence and other Apple generative AI models. Activity has surged significantly since the rollout of Apple Intelligence in 2026.
AI2Bot
Developed by the Allen Institute for AI for academic research and AI model development.
Ai2Bot-Dolma
A variant of AI2Bot used to collect data for the Dolma open dataset, which is used to train open-source language models.
StableDiffusionBot
Crawls web content, particularly images, for training Stable Diffusion and other generative AI image models.
img2dataset
A tool used to download and process large image datasets for AI and machine learning training.
TurnitinBot
Used by Turnitin to collect content for plagiarism detection databases and AI training. Blocking it prevents your texts from being included in their detection systems.
DataForSeoBot
Crawls websites to collect large datasets used for SEO analysis, data resale, and AI model training.
Diffbot
An AI-powered web scraper that structures web data for knowledge graph construction and various AI applications.
magpie-crawler
Focuses on content aggregation and building training datasets, primarily for social listening and AI purposes.
sentibot
Likely used for sentiment analysis data collection and AI model training.
Omgilibot / Omgili
Scrapes content from forums and discussion platforms, primarily for market research and AI training datasets.
Webzio-Extended / webzio
Crawlers from Webz.io used to gather web data for content analysis and AI training datasets.
ImagesiftBot
Specialises in crawling and collecting images, potentially for AI training or visual content analysis.
PanguBot
Linked to Huawei’s Pangu AI models. Collects web data for large language model training.
ErnieBot
Baidu’s AI model crawler, used to collect training data for Ernie Bot, Baidu’s large language model.
DeepseekBot
Crawler used by DeepSeek to collect web data for training their AI language models.
ChatGLM-Spider
Crawler used by Zhipu AI to collect training data for ChatGLM, their large language model.
AIMatrixCrawler
Crawls web content for AI matrix and machine learning training purposes.
FirecrawlAgent
A scraper-as-a-service platform used by many different clients and AI applications to extract structured web content. Respects robots.txt but is multi-tenant, meaning many different operators drive the same crawler.
Timpibot
Indexes and fetches data for search and AI applications.
YouBot
Crawler used by You.com to collect content for AI-powered search and model training.
KomoBot
Used by Komo Search to gather data for AI-enhanced search and training purposes.
iAskAI-Crawler
Collects web content to generate answers for the iAsk.ai search and question-answering platform.
PiplBot
Gathers information for Pipl’s people search and identity verification services.
AI Search & Answers
These bots retrieve content for AI-driven search results or direct answers to users. They are blocked by default, but allowing them may help your site appear in AI search engines such as ChatGPT Search or Perplexity.
PerplexityBot
Perplexity AI’s primary crawler used to build its retrieval index for AI-generated search answers. Known for high crawl frequency and a focus on news and high-authority content. Provides no referral traffic back to your site.
Perplexity-User
Fetches pages in real time when a user submits a query to Perplexity AI. Blocking this prevents your content from appearing in Perplexity answers.
OAI-SearchBot
OpenAI’s retrieval crawler for ChatGPT Search. Separate from GPTBot. Allowing this bot while blocking GPTBot lets your site appear in ChatGPT Search without contributing to model training.
ChatGPT-User
Fetches pages in real time when a ChatGPT user requests up-to-date information. Blocking this prevents your content from being retrieved for live ChatGPT answers.
Claude-User
Fetches pages in real time when a Claude user requests current information. Blocking this prevents your content from being retrieved for live Claude answers. Does not affect search engine indexing or ranking.
Claude-SearchBot
Anthropic’s retrieval indexing crawler for Claude’s search capabilities. Separate from ClaudeBot, which is used for training.
Google-Extended
Google’s opt-out control token for AI training and Gemini products. Blocking this prevents your content from being used to train Google’s AI models and from appearing in AI-generated summaries, while leaving regular Google Search indexing unaffected.
GoogleOther
A Google crawler used for internal purposes such as AI research and training, not for regular search indexing. Blocking this has no effect on your Google Search rankings.
Google-Agent
Google’s user-triggered fetcher, used when a user asks Gemini or Google AI Overviews for current information. Note: this bot does not respect robots.txt. Blocking via robots.txt is therefore ineffective.
DuckAssistBot
DuckDuckGo’s bot for DuckAssist, their AI-driven answer and content summary feature.
OpenAIContentCrawler
Gathers data specifically for OpenAI’s content-related tools and retrieval-augmented generation.
General Indexing
These bots are primarily used for search engine indexing but have known AI or data collection applications. They are allowed by default. Blocking them may affect your visibility in the relevant search engines.
YandexBot
Yandex’s primary web crawler for search indexing. Also used for AI applications within Yandex services. Blocking it will remove your site from Yandex Search.
Baiduspider
Baidu’s primary web crawler for indexing in Baidu Search, the dominant search engine in China. Also feeds Baidu’s AI products. Blocking it will remove your site from Baidu Search.
Sogou
Sogou’s web crawler for Sogou Search, a major search engine in China now operated by Tencent. Also used for AI data collection.
360Spider
Web crawler used by Qihoo 360’s search engine in China. Also used for AI-related data collection.
PetalBot
Huawei’s web crawler associated with Petal Search. Used for search engine indexing with potential AI data applications.
FacebookBot
Used by Meta for indexing public content for social media search and discovery features. Not to be confused with facebookexternalhit, which handles link previews and should not be blocked.
Bytespider
Associated with ByteDance, TikTok’s parent company. May collect data for AI and content generation tools. ByteDance has no official documentation page for this crawler.
Grok / GrokAI / XAI / XBot
Several user agent variants associated with xAI, the AI company behind Grok. Used for data collection and AI research. xAI has no official documentation page for these crawlers.
For more information about how AI Scrape Protect works, visit the AI Scrape Protect plugin page.


