In August, OpenAI introduced a web crawler called the GPT Bot. It has been touted as a ground-breaking technology to support AI platforms like GPT-4 and GPT-5.
However, many major websites out there have blocked GPT Bot from crawling themselves. Reason?
They’re worried about their stuff being copied or used without their permission by the bot.
The Power of Large Language Models
Large language models like ChatGPT have revolutionized the way we interact with technology. These models rely on massive amounts of data to train their systems, enabling them to respond to user queries in a manner that closely mimics human language patterns.
However, the source of this data is not given, pointing towards the copyright breach.
The Role of Robots.txt Files
The decision to block GPT Bot is evident in the robots.txt files of these websites.
Robots.txt files serve as directives to web crawlers, specifying which pages they are permitted to access.
Major websites, making up the top 25% of internet destinations, have opted to deny access to GPT Bot as a precautionary measure to safeguard their copyrighted content. This move reflects their concern that OpenAI might use their data without compensation or proper attribution.
OpenAI, acknowledging these concerns, has attempted to clarify the benefits of allowing GPT Bot access to websites. In a blog post, the organization stated–“allowing GPT Bot to access your site can help AI models become more accurate and improve their general capabilities and safety.”
The Dilemma for SEOs
For search engine optimization specialists (SEOs), the decision of whether to block or permit access to ChatGPT and other similar models is quite confusing.
These AI systems do not cite or link to their information sources, which has raised ethical and operational questions.
Allowing search engines to crawl content can result in increased traffic through direct links and citations, but it also means potential exposure to AI-driven content scraping.
Websites That Block GPTBot
As of the latest data, 12 prominent websites have joined the ranks of those blocking GPT Bot.
These websites, many of which are renowned for publishing news and information, include Pinterest, Indeed, The Guardian, ScienceDirect, USA Today, Stack Exchange, Alamy, WebMD, Dictionary.com, The Washington Post, NPR, and CBS News.
Favouring CCbot: A Partial Discrimination?
Comparatively, Common Crawl’s web crawler, CCbot, faces fewer restrictions, with just 130 websites blocking its access.
Common Crawl plays a pivotal role in providing training data for OpenAI, Google, and other entities involved in AI development. Interestingly, 109 of the top 1,000 websites have chosen to block both GPT Bot and CCbot, reflecting a broad reluctance to grant access to web crawlers.
Limitations and Conclusion
It’s essential to acknowledge that there are limitations to this analysis. Sixty-seven robots.txt files out of the 1,000 websites surveyed were not identified or inspected, leaving room for uncertainty regarding the true extent of websites blocking GPT Bot.
In conclusion, the blocking of GPT Bot by major websites underscores the ongoing tension between the advancement of AI technology and the protection of intellectual property rights. While AI models like ChatGPT hold great potential for various applications, their lack of attribution and citation has raised concerns among content providers.
As AI continues to evolve, finding a balance between innovation and copyright protection remains a complex challenge that will require ongoing discussion and collaboration between technology developers and content creators.