Anthropic is accused of “aggressive” data scraping

Several companies have denounced the “aggressive” behavior of Anthropic’s web crawler, which visits websites up to millions of times a day to collect data.

To make AI models intelligent, large amounts of data are required. It is now an open secret that the data comes from the internet. AI companies like OpenAI and Anthropic have “web crawlers” that search the internet and collect publicly available information. In theory, this practice is not illegal, although Anthropic seems to go quite far in this regard.

Kyle Wens, CEO of iFixit, reprimands Anthropic in a post on X. Anthropic’s web crawler reportedly visited the site a million times in 24 hours. It might even be better: The website Freelancer.com recorded 3.5 million visits from Anthropic in just four hours.

Hey @AnthropicAI: I understand that you are hungry for data. Claude is really smart! But do you really need to reach our servers a million times in 24 hours?

Not only are you taking our content without paying, you’re also tying up our development resources. Not cool.

— Kyle Wiens (@kwiens) July 24, 2024

Rules of the Internet

Both iFixit and Freelancer.com condemn Anthropic’s “aggressive” way of crawling the Internet. Besides the fact that Anthropic runs on their content, excessive activity from web crawlers can overload servers.

On Freelancer.com, things got so bad that the web administrators even had to blacklist Anthropic. “They’re breaking the rules of the internet,” CEO Matt Barrie told the Financial Times. Anthropic responded that it is investigating the complaints and has no intention of being intrusive.

Makers of large AI models have been under fire for some time for the way they handle public data on the internet. Industry players argue that what is public can be used to train models, although this reasoning is not entirely correct. Equally important is copyright on the internet.

Licensing agreements have now been signed between AI companies and news media or large internet platforms like Reddit, which manage and own a lot of content. In this way, AI companies hope to avoid future lawsuits. Anthropic has not yet entered into such agreements.

Robots.txt

As a web administrator, you can deny web crawlers access to your website. robots.txtCopying the file to your website’s directory will put up a stop sign for web crawlers. However, the system is far from watertight. In fact, it’s quite easy to bypass web crawlers by “masquerading” them as legitimate website visitors.

Source: IT Daily

Mary

As an experienced journalist and author, Mary has been reporting on the latest news and trends for over 5 years. With a passion for uncovering the stories behind the headlines, Mary has earned a reputation as a trusted voice in the world of journalism. Her writing style is insightful, engaging and thought-provoking, as she takes a deep dive into the most pressing issues of our time.