OpenAI has launched a web crawler robot called GPTBot to collect information data to improve future AI models.
It is understood that GPTBot will strictly follow the rules of any paywall, will not capture information that requires payment, and will not collect data that can be traced to an individual’s identity.
Not only that, OpenAI also puts the choice of whether to make their website data available for the GPTBot to crawl, and they can modify their own robots.txt file. Or by blocking their IP address, to prevent GPTBot from scraping data from their website.
Modifying robots.txt is one way, but it could be more convenient and transparent, further informing what the data will be used for, and so on.
Previously, OpenAI’s practice of scraping publicly available data to train proprietary AI models was controversial. Sites like Reddit and Twitter have taken steps to crack down on AI companies’ free use of their users’ posts, while some authors and other creators have filed lawsuits for allegedly unauthorized use of their work.