No products in the cart.
Imagine being a diligent open-source community admin, quietly maintaining servers that support developers day in and day out. Then one day, your traffic logs suddenly swarm with uninvited guests—not hackers, not hobbyist scrapers, but AI training bots. These digital locusts devour your bandwidth like ravenous wolves, grinding page loads to a halt. This isn't a sci-fi dystopia plot—it's the harsh reality facing small-to-medium internet services today.
The New DDoS Threat: AI Crawlers
This Monday, SourceHut, an open-source Git hosting platform, posted a desperate plea on its status page: "We're under relentless assault by aggressive LLM (Large Language Model) crawlers, repeatedly disrupting services." To fight back, they deployed a "Nepenthes" trap to snare AI-driven scrapers and outright banned IPs from cloud providers like Google Cloud and Microsoft Azure—the epicenters of crawler traffic.
SourceHut admits these measures provide temporary relief but risk collateral damage to legitimate users, degrading their experience.
This isn't SourceHut's first rodeo with "crawler DDoS." In 2022, they publicly slammed Google's Go Module Mirror for its unchecked requests, likening it to a "denial-of-service attack." Now, as generative AI sweeps the globe, similar tales echo across the internet.
Repair guide iFixit complained last July about Anthropic's Claudebot over-crawling; cloud host Vercel revealed in December 2024 that OpenAI's GPTbot made 569 million requests in a single month, while Anthropic's Claudebot clocked 370 million—combined, that's 20% of Googlebot's traffic. Diaspora developer Dennis Schubert even disclosed that 70% of his server's traffic over 60 days came from LLM training bots.
Why are AI crawlers so aggressive? Simple: data hunger. The rise of generative AI has OpenAI, Anthropic, Google, and others scrambling to feed their models with internet content. Whether it's ChatGPT's conversational skills or Claude's reasoning, it all demands massive data. But this "take-what-you-can" approach is crushing smaller services, morphing into a silent DDoS crisis.
Top 10 "DDoS Attack Power Index" of AI Models
To visualize the damage, GoUpSec compiled a ranking based on recent reports and public data. The index factors in request volume, website coverage, ban frequency, and service impact. Here's the list:
Bytespider (ByteDance)
Index: 95
Cloudflare data shows it leads in requests and coverage, hoarding data for ByteDance's AI. Frequent bans highlight its aggressive tactics.
GPTbot (OpenAI)
Index: 90
569 million monthly requests are staggering. Despite promises to respect robots.txt, spoofing and overload complaints keep it in second place.
Claudebot (Anthropic)
Index: 85
370 million requests + iFixit's million-request day make it a "gentle killer." Lower ban rate but potent impact.
Amazonbot (Amazon)
Index: 80
Built for Alexa indexing, but developers decry overload. Suspicious spoofing adds to concerns.
Google-Extended (Google)
Index: 75
Designed for AI training, 13.6% of top sites ban it. Its dual role (search + AI) leaves sites in a dilemma.
AppleBot (Apple)
Index: 70
Transparent but DoubleVerify links it to 16% of 2024's invalid traffic.
Meta AI Bot (Meta)
Index: 65
Ambitious crawling for Meta's AI, but multi-purpose use complicates bans.
CCBot (Common Crawl)
Index: 60
Open-source mainstay, 22.1% ban rate shows widespread impact. Less aggressive than commercial peers.
OAI-SearchBot (OpenAI)
Index: 55
Newcomer banned by 14 media outlets. Potential yet to peak.
Perplexity AI Bot (Perplexity)
Index: 50
AI search upstart, disguises as browsers to sneak data, angering admins.
Fighting "Freeloading": Surrender or Resist?
AI crawlers follow unspoken rules. In August 2023, OpenAI pledged GPTbot would obey robots.txt, with others following suit. Yet reality shows promises ≠ action. Schubert found logs flooded with fake GPTbots spoofing IPs from AWS and residential networks—trolls exploiting chaos. DoubleVerify reports AI-driven "General Invalid Traffic" surged 86% in late 2024, 16% from GPTbot/Claudebot.
This reflects a power struggle: AI firms need internet "food," while site owners face bandwidth theft, privacy risks, and copyright issues. SourceHut's cloud IP bans and iFixit's robots.txt updates are passive defenses. Trickier still are dual-role crawlers like Googlebot, forcing sites to choose between blocking and losing visibility (Editor's note: Google's 2023 Google-Extended token allows targeted blocking without hurting SEO).
The "DDoS effect" of AI crawlers exposes generative AI's hidden costs. For small services, it's existential. Tools like SourceHut's Nepenthes trap and Cloudflare's AI crawler blocker show tech communities fighting back. But long-term, bans aren't solutions.
The future likely lies in balance: AI companies need transparent data policies, perhaps licensing or payment models. Site owners must balance protection with participating in AI ecosystems. Otherwise, this cat-and-mouse game will further degrade the internet's health.