OpenAi and ChatGPT can use your site’s content to “learn” (and provide answers). If you want to (try to) block them, read this article.
I answer the main question first, before making some other comments
ChatGPT (and OpenAI’s products, and by extension Bing) uses multiple data sources (datasets) to train its learning algorithms. According to my research, there would be “many”, at least these:
The only dataset you can try to act on is Common Crawl.
To do this, if you want to try to prohibit access to your site to ChatGPT, you must prohibit it from crawling using a directive in the robots.txt file. Of course, this will only have an impact for the future…
For Common Crawl, the User Agent name to use in the robots.txt file is .CCBot
To prohibit ChatGPT crawling the entire site, you must add these 2 lines:
To explicitly allow ChatGPT to crawl the entire site, you must add these 2 lines:
Of course, this is to be adapted to your situation. Read my guide to the robots file.txt to learn how to prohibit crawling a directory, or a subdomain, or other more specific cases.
According to Common Crawl documentation:
Read on where I explain that it is probably useless…
Since ChatGPT also manages plugins, other robots can crawl your site. This is what happens if a ChatGPT user asks to exploit the content located on your site.
In this case, the crawler identifies itself as ChatGPT-User:
To explicitly allow ChatGPT plugins to crawl the entire site, you must add these 2 lines:
No, it is not possible to ensure that your content is not exploited by ChatGPT and OpenAI.
First, your content may have already been used. There is no way (currently) to remove content from a dataset.
Then, it is almost certain that your content is in other datasets than Common Crawl.
Finally, I guess there are probably other technical reasons why you can’t guarantee that these AIs won’t exploit your content…
Basically, I think it’s normal to want to control whether or not a third party has the right to exploit (for free) the content published on your site.
We’ve been used to operating with some sort of tacit agreement between search engines and site publishers. The latter allow by default search engines to crawl and index their content, in exchange for free visibility offered in the results pages. And therefore a contribution of visitors.
In the case of AI-based tools, if none of their sources are indicated in the response provided to the user, then this type of tacit agreement no longer exists.
I have the impression that with ChatGPT plugins, it is much more likely that your content will be mentioned (if it has been crawled by these plugins).
I also note that Bing’s conversational search (which leverages ChatGPT) mentions sources (with links), but I get the impression that this is mostly what Bingbot found. If this is indeed the case, ChatGPT blocking is not affected here. But is excluding your site from these tools really the best thing to do? Isn’t that also the future of research? And if these tools ever come to mention their sources, not being there becomes a weakness in your search marketing strategy.
Get business tips once a month