BOT-WAR COMMAND CENTER “Crazy amount of guests”

47 pages · posts #1–#928 · Started 2025-10-13 (Levina) · last post 2026-07-04 · LOCKED · capture 2026-07-05
xenforo.com/community · Forum management47 pages · posts #1–#928 Started 2025-10-13 (Levina) · last post 2026-07-04 · LOCKEDCaptured 2026-07-05 · verbatim raw/ + notes/ archive View live thread ↗

About this summary

Citations are forum post numbers (e.g. #322). Where a claim is a specific user's, the username is given.

Executive summary

Levina, running a small photography forum on XenForo Cloud, opens the thread when guests jump from a few hundred to 4,800+ — mostly from Brazil, Vietnam, and Singapore, many showing "Viewing unknown page" (#1). The community quickly rules out a classic DDoS and diagnoses poorly-behaved AI/LLM crawlers scraping content to train models (#2, #11, #14). What starts as one admin's problem becomes the forum's definitive reference thread on bot mitigation.

The discussion splits into recurring, well-defined camps that persist for 47 pages:

  • "Shed it at the edge" (Cloudflare-first): managed-challenge rules, Under Attack Mode, guest edge-caching, and (for those who can afford it) Enterprise Bot Management. Champions: Anthony Parsons, eva2000, wwillson, digitalpoint, and eventually Chris D (XenForo staff).
  • "Fight it at the server" (self-host / app-side): fail2ban, iptables/CSF, .htaccess ASN/CIDR deny-lists, proof-of-work (Anubis), and purpose-built systems. Champions: ES Dev Team, BrettC, smallwheels, dutchbb.
  • "Absorb / just tune it": a well-optimized server makes bot load a non-issue; chasing IPs is endless. Champion: Anthony Parsons (with eva2000, JustinHawk).
  • Add-on authors who ship concrete tools: zeeb0t (XF Surge Guard → Bot Guard), Osman ([XTR] IP Threat Monitor), digitalpoint (App for Cloudflare edge caching), Sim (KnownBots).

The core technical realization across the thread is that the enemy shifted from datacenter ASNs (blockable) to residential proxies — compromised Android-TV boxes, SDK-embedded apps, and routers that rotate IPs and make one request per IP, defeating rate-limiting, UA filtering, CAPTCHAs, and country/ASN blocks. Detection tools (proxycheck.io, Cloudflare's free tiers) catch only a fraction (~10–50%) of RESIPs.

The thread's emotional peak comes late: xenforo.com's own forum is overrun (~190k–200k guests) and only recovers when XenForo enables Cloudflare Under Attack Mode (#726#729, #769). This fuels a heated responsibility debate: smallwheels argues XenForo should publish best-practices and build app-level behavioural sensors; Chris D (XF developer) counters that mitigation belongs at the edge, not in a database-driven app ("a fool's errand," #911#917) and that privacy/GDPR blocks shipping fingerprinting in core; Anthony Parsons says it's the add-on market's job, not XenForo's (#908, #922). zeeb0t reframes bot traffic as "one of the most important problems facing the public web" where no single layer solves everything (#925); the thread de-escalates and locks at #928 on a conciliatory final post by smallwheels.

The four camps

Shed it at the edge

Anthony Parsons · eva2000 · wwillson · digitalpoint · Chris D

Cloudflare-first: managed challenges, Under Attack Mode, guest edge caching, Enterprise Bot Management for those who can afford it.

Fight it at the server

ES Dev Team · BrettC · smallwheels · dutchbb

fail2ban, iptables/CSF, ASN/CIDR deny-lists, Anubis proof-of-work, purpose-built systems. “If nobody fights it, we lose the indie internet.”

Absorb / tune it

Anthony Parsons · eva2000 · JustinHawk

A well-optimized server makes bot load a non-issue; chasing IPs is endless. ~100 users/sec/core; 1M uniques on a $12 Linode.

Ship add-ons

zeeb0t · Osman · digitalpoint · Sim

Concrete tools: Bot Guard, [XTR] IP Threat Monitor, App for Cloudflare edge caching, KnownBots.

RESIPs: one request per IPUS #1 source · Amazon 9%+6.1%/search & proxy.php scrapedIPv6 surge → app layer 1. Traffic & RESIPs xf_user cookie + /search rules Under Attack Mode (automatable) Enterprise JA3/JA4 ~$2k/mo Pay-Per-Crawl & monopoly critique 2. Cloudflare fail2ban → iptables CSF · ASN & country deny-lists CentminMod + Redis tuning stub-ASN forensics 3. Server-side Anubis difficulty 4–16 0% scraper success @ level 5 Markov tarpits · poisoning CF AI Labyrinth 4. Proof-of-workApp for Cloudflare (edge cache)[XTR] IP Threat MonitorBot Guard — $0, behaviouralKnownBots (232k UAs) 5. Add-onsTLS/JA4 + FingerprintJSweb-bot-auth draft specphp2ban → shared reputation netbehavioural clustering 6. Detection frontier AI-search vs training bots Bright Data court rulings registration walls (regs ×3–4) ad & analytics pollution 7. Ethics & SEO Chris D: edge, not app smallwheels: app sensors Anthony: add-on market’s job locked at #928 8. Responsibility ChatGPT-5.5 experiment “AI slop” critiques prompt technique (#618) 9. Meta: AI summaries Crazy amount of guests 928 posts · 9 topics

// click a branch → jumps to its deep-dive in the Topics tab

Chronological arc

Phase Pages / posts Dates What happens
Onset & diagnosis p1–3 / #1–60 Oct–Dec 2025 Levina's surge; ruled a crawler wave not DDoS; first tools named (KnownBots, Cloudflare, fail2ban); early Cloudflare recipes (RippC's Brazil challenge #30, wwillson's UA rule #37).
Edge-cache breakthrough p4–5 / #61–100 Dec 2025 Andy.N's 37k-guest / 488%-CPU crisis solved by digitalpoint's edge caching (#74–75). ES Dev Team reveals "php2ban" design (#93). Live Cloudflare dashboard outage (#95–100).
RESIP escalation & tooling p6–16 / #101–320 Dec 2025 – Mar 2026 Residential proxies become the central theme; Anubis proof-of-work deep-dives (BrettC); ASN blocklists; stub-ASN forensics; court-ruling debate; RESIP-vendor survey.
Add-on era p17–32 / #321–640 Mar–May 2026 Osman's IP Threat Monitor and zeeb0t's Surge Guard → Bot Guard released and iterated; Anthony's /search cookie rule widely adopted; the "fight vs absorb" and Cloudflare-monopoly debates intensify; members test ChatGPT summarizing the thread (#592–619).
Technical core & self-attack p33–45 / #641–900 May–Jun 2026 Live attack on xenforo.com itself (~190–200k guests); Anthony's CentminMod/Redis/Elasticsearch benchmarks; IPv6 surge; TLS fingerprinting; ES Dev Team's shared IP-reputation network; Cloudflare Pay-Per-Crawl / monopoly critique.
Responsibility flare-up & close p46–47 / #901–928 Jul 2026 Chris D (XF) enters; edge-vs-app clash with smallwheels; Anthony vs "aggressive person"; zeeb0t de-escalates; thread locked at #928.
Phase 1Oct–Dec 2025p1–3 · #1–60

Onset & diagnosis

Levina’s photography forum jumps from a few hundred guests to 4,800+. The community rules out DDoS and diagnoses poorly-behaved AI/LLM crawlers. First tools named: KnownBots, Cloudflare, fail2ban.

  • (#1) Levina opens: 4,800+ guests from Brazil/Vietnam/Singapore
  • #14 Digital Doctor: the real cost is free training data
  • #30 RippC’s Brazil challenge “stopped it dead”
  • #37 wwillson’s UA rule: 20,000 → 6,000 guests
Phase 2Dec 2025p4–5 · #61–100

Edge-cache breakthrough

Andy.N’s 37k-guest / 488%-CPU crisis is solved almost instantly by digitalpoint’s Cloudflare guest edge caching — the thread’s most dramatic single fix.

  • #63 Andy.N: 37k guests, mariadb at 488% CPU
  • #74–75 Edge caching drops load to single digits
  • #93 ES Dev Team reveals “php2ban” design
  • #95–100 live Cloudflare dashboard outage
Phase 3Dec 2025 – Mar 2026p6–16 · #101–320

RESIP escalation & tooling

Residential proxies become the central theme: one request per IP, ~10% detectable. Anubis proof-of-work deep-dives, ASN blocklists, stub-ASN forensics, and the scraping court-ruling debate.

  • #147 smallwheels: permanent stream of residential proxies
  • #157 30–50%+ of IPs are RESIPs; ~4 pools resold
  • #179 Anubis level 5: 0% scraper success in a 7-day flood
  • #259 Bright Data rulings: scraping public data is generally legal
Phase 4Mar–May 2026p17–32 · #321–640

Add-on era

Osman’s IP Threat Monitor and zeeb0t’s Surge Guard → Bot Guard ship and iterate. Anthony’s /search + xf_user cookie rules are widely adopted. Fight-vs-absorb and Cloudflare-monopoly debates intensify.

  • #322 cookie rule: 1.5M events / ~2k solves in 24h
  • #358 /search rule: 1M+ → ~150k daily uniques
  • #510/#572 Surge Guard → Bot Guard released ($0)
  • #592–619 ChatGPT thread-summary experiment → “AI slop” critiques
Phase 5May–Jun 2026p33–45 · #641–900

Technical core & self-attack

xenforo.com itself is overrun (~190–200k guests) and recovers only with Under Attack Mode. CentminMod/Redis/Elasticsearch benchmarks, IPv6 surge, TLS fingerprinting, shared IP-reputation network.

  • #726/#769 xenforo.com ~200k guests → 61 after UAM
  • #705–719 Anthony’s benchmarks: 1M uniques on 2-core/4GB
  • #851 shared cross-site IP-reputation network proposal
  • #845 IPv6 ~60/40 split; per-IP blocking hopeless
Phase 6Jul 2026p46–47 · #901–928

Responsibility flare-up & close

Chris D (XenForo) enters: mitigation belongs at the edge, app-layer solving is “a fool’s errand”, GDPR blocks core fingerprinting. smallwheels argues for app-level behavioural sensors. zeeb0t de-escalates; the thread locks.

  • #909–917 Chris D: edge, not app; open to guest-caching docs
  • #910/#916 smallwheels: sensors ≠ blocking; publish best practices
  • #908/#922 Anthony: it’s the add-on market’s job
  • #925–928 zeeb0t reframes; locked on smallwheels’ conciliatory note
1. Traffic & RESIPs2. Cloudflare3. Server-side4. Proof-of-work5. Add-ons6. Detection frontier7. Ethics & SEO8. Responsibility9. Meta: AI summaries

1Characterizing the traffic — AI scrapers, not DDoS

  • Diagnosis (#1–2, #10–11, #14): Levina's "Viewing unknown page" guests from Brazil/Vietnam are crawlers, not an attack (past attacks crashed the site; this doesn't). Digital Doctor (#14): the real cost is handing AI companies your data for free.
  • Geographies: Brazil, Vietnam, Singapore, China dominate early; later US becomes #1 (~4× any other country) for residential proxies (smallwheels #209, #241). Germany/UK/Canada rise as ASNs spread IPs across countries. Alibaba/ByteDance route LLM-training jobs to Singapore/Malaysia after US tightened Nvidia H20 export controls (ES Dev Team #60, citing FT).
  • Residential proxies (RESIPs) — the crux: shift from single-IP datacenter megawaves to a permanent stream of residential proxies (smallwheels #147). Signature: each IP does ONE request, raw content only (no JS/CSS/images) — identifiable only in hindsight. 30–50%+ of visiting IPs are RESIPs (#157, #213); only ~10% detectable (#157). Sourced via hidden TOS in free apps, SDK proxy functions, IoT infections, and cheap Chinese Android-TV sticks (#157; Wired 2024 / Krebs 2025 cited #242–243). Effectively ~4 IP pools resold under many brands (#157).
  • Endpoints scraped: /search/ with a username (username de-anonymization) is the primary vector (Anthony #322, #354, #358); also /whats-new/, /find-new/, XF image proxy proxy.php (lazy llama #158), /misc/style-variation as a precursor to mass GETs (BrettC #746), bare thread-number enumeration GET /forums/threads/385472/ (lazy llama #757), /posts/N/bookmark & /report (Jake B. #891), profiles/attachments on R2 storage (puterfixer #503). Top paths one day: /search/ 23,603, / 13,144, /whats-new/posts/ 11,524 (BrettC #747).
  • Top named sources: Amazon/AWS = #1 bot source — one Amazon ASN = 9% of all worldwide bot traffic, a second = 6.1% (smallwheels #591, per CF Radar). OpenAI/GPTBot "extremely aggressive, chains requests" (BrettC #746). ClaudeBot, Amazonbot, Applebot rising (eva2000 #508). ByteDance/Bytespider/BytePlus (stopped self-identifying, #144). facebookexternalhit mystery UA from residential IPs (#521–523). Fake Googlebots (#671).
  • IPv6 surge (p41–45): mobile clients IPv6-native (~60/40 split by 2026, BrettC #845); WebNX IPv6 dictionary/rainbow attacks on DNS (#841); chillibear (#843): blocking must move to the application layer because the underlying IPv4 proxy connection "looks good."

2Cloudflare mitigations

  • RippC's Brazil interactive challenge (#30): "stopped it dead."
  • wwillson's UA rule (#37): block non-listed IPs whose UA contains GPTBot/PerplexityBot/AppleBot/bingbot/etc.; ~75% of those UAs are forged; never block Googlebot (use CF "Verified bots: Allow"). Guests 20,000 → 6,000.
  • Anthony Parsons' rules (widely adopted):
    • Managed-challenge any .php not in XF's allowlist (#312).
    • Cookie rule: (not http.cookie contains "xf_user=" and not cf.client.bot) → Managed Challenge; 1.5M events / ~2k solves in 24h (#318, #322).
    • /search rule: (http.request.uri.path contains "/search/" and not http.cookie contains "xf_user=") → Managed Challenge; cut daily uniques 1M+ → ~150k, steady at ~110–140k once the origin was locked to CF-only IPs (#358).
    • Lock origin to CF IPs only via .htaccess; runs everything on CF free tier (#316, #330).
  • zeeb0t's 7-part rule set (#511): skip admin/robots; block sensitive XF paths; managed-challenge fake search/social bots and non-beneficial crawlers; challenge unauthenticated guest GETs to /threads/ /whats-new/ etc.; rate-limit guests >20/min. "Managed Challenge first, hard-block only when confident."
  • Wildcat Media's 5-rule "nuclear" setup (#374): whitelist staff/xf_user; block continents; block AI bot categories; allow good bots; challenge everyone else. Plus Zero Trust / Access on admin+install dirs (#409).
  • Under Attack Mode (UAM): blanket JS challenge; credited for crushing xenforo.com's guest spikes (#729, #769) and webbouk's 30k→430 (#697); but frustrating (Turnstile silently expires, losing long post drafts — smallwheels #776). Several members automate UAM toggling via the CF API on guest spikes (z3r010 #376, wolfgangm #387, ES Dev Team #849).
  • Rule-order gotcha (z3r010 #414): UAM fires before custom Skip rules, so logged-in users still get challenged; fix with invisible Turnstile pre-clearance. Apple Private Relay (Safari) implicated in repeat challenges (#417, #422).
  • Tiers & Bot Management: Free = Bot Fight Mode; Pro/Biz = Super Bot Fight Mode + WAF custom rules; Enterprise-only = full Bot Management with JA3/JA4 TLS fingerprinting and the ML residential-proxy model (eva2000 #518, #642). Enterprise starts ~$2,000/month (ES Dev Team #828). Cloudforce One offered against RESIPs (#642), read by lazy llama as a "protection scheme" (#659).
  • AI Labyrinth (#237, #700): CF routes bad bots into an endless maze; Wildcat enables it "to refuse to feed the oligarchs."
  • Pay-Per-Crawl / HTTP 402 / monopoly critique: CF's closed-beta pay-per-crawl uses HTTP 402 (Wildcat #428); if it reaches forums, expect "~$0.10 per 10,000,000 requests" (lazy llama #366). smallwheels repeatedly frames CF as a de-facto monopoly / middleman that wants to "commercialize content" while ignoring the residential-proxy swarm and hosting the very proxy vendors it fights (#300, #367, #568, #876, #882). ES Dev Team (#879): "the fate of the internet is in their hands."
  • Guest edge caching (digitalpoint): see Add-ons §5 — the single most dramatic server-load fix in the thread.

3Server-side mitigations

  • fail2ban stack (ES Dev Team): apache .htaccess → fail2ban → iptables across ~32–35 servers; "fail2ban black belt" offer (#104); rate-limits 404/403/401 + POST speed. Caveat: fail2ban is single-threaded Python and falls behind under distributed waves (#205, #222); entered a faulty state untuned during one wave (#676).
  • iptables / CSF / .htaccess ASN-CIDR deny-lists: BrettC's monthly RADB cron (whois.radb.net → aggregate with iprange → nftables/iptables, #119, #339). dutchbb's CSF stack (cc_deny countries/ASNs, SetEnvIfNoCase UA blocks for headless frameworks, firehol in lfd, Connlimit/portflood): 3–6,000 → 200–400 guests (#441, #443, #457). South-American country blocks had the biggest effect. nginx return 444 to bad actors (BrettC #445).
  • Blocklist sources: RADB, firehol/blocklist-ipsets (400+ feeds, L1–L4, #350, #457), AbuseIPDB (borestad list, #464), team-cymru, bgp.tools, asn.ipinfo.app, StopForumSpam/RBL.
  • Stub-ASN detection (#219): per c't magazine — a stub ASN (one uplink, no peering) signals bulletproof hosting created via identity theft; smallwheels confirmed AS206092 via Datacamp AS60068 in his logs. Anthony: "/24 ranges across all countries = 100% dodgy" (#220).
  • Anthony Parsons' CentminMod + Redis + Elasticsearch benchmarks (#705#719, the tuning centerpiece):
    • Stock XF (10k threads/100k posts) on a 1-core/2GB Linode ($12/mo); K6/Grafana load tests.
    • Biggest win = disk/page caching (Redis page cache took page loads "from seconds to under a second," #716, #719); ~100 users/sec per core before graceful degradation.
    • Tested 1M/2M/3M simulated daily uniques: 2-core/4GB handles 1M easily, 2M snappy after opcache, 3M still usable. Biggest bottleneck = MySQL (fix: thread_handling=pool-of-threads; Elasticsearch not vulnerable to table locks like MySQL search).
    • Thesis: a ~2M-post forum runs on ~$10–15/mo and "you don't need expensive, you just need access and optimisation." (eva2000: CentminMod auto-tunes the LEMP stack; JustinHawk: 1M requests/min on a $10 server, #629.)
  • "Fight it" vs "absorb/tune it": Absorb camp (Anthony, eva2000, JustinHawk): a tuned server makes bot load irrelevant; chasing IPs is endless. Fight camp (ES Dev Team #667: "if nobody fights it, we lose the indie internet"; scale = 35 servers, 5-figure/yr bandwidth). smallwheels (#310): "today protecting against scrapers costs way more than running the forum." Give-up/accept camp: Ricsca, philmckrackon ("losing battle… scrape away!"), cdub.

4Proof-of-work — Anubis (and tar-pits / content-poisoning)

  • Anubis (Techaro) is the thread's signature proof-of-work WAF, championed by BrettC (deployed on a non-XF "combat-log parser" site with 2–3M subpages, #136):
    • PoW, not CAPTCHA — solved automatically by the client; LLM scrapers can't bypass. Difficulty scale to 16: 4 = seconds (sweet spot, no complaints), 5 = most bots fail, 6 = up to a minute (bad for old hardware), 16 = effectively a shun-list (#138, #182). YAML botPolicies, valkey/redis backend.
    • Results: 0% scraper success on a 7-day flood at level 5 (#179); ~90% of abusive botnets culled (#370); logged-in users bypass via cookie.
    • Caveats: forced cookie = possible GDPR concern (#138); can bottleneck CPU under extreme load (JMeter/ApacheBench testing, #410); Techaro charges ~$500/mo to customize / remove the default anime mascot → compile your own (ES Dev Team #59). Deployed later on Anthony's test forum in a CF → Anubis → origin chain (#772), peaking at 4,500 redis keys / 30 min (#777). BrettC (#822): once botnets adapt, they "eat the extra compute" → need harsher levels for non-ISP providers.
  • Tar-pits / content-poisoning / "bogopedia": chillibear and lazy llama propose Markov-chain "babbler" tarpits to generate junk pages that waste and poison scrapers (#162–163). smallwheels proposes "poison the content" (feed false/half-true info, huge images) and a community "bogopedia" (#160); and poison-via-HTTP-code — services that bill per successful request pay nothing for a 403 but pay for a 404/bogus page (#255). CF's AI Labyrinth is the commercial version.

5Add-ons

  • [DigitalPoint] App for Cloudflare — guest edge caching (digitalpoint): the thread's most dramatic single fix. Andy.N's server (37k guests, mariadb at 488% CPU, load ~100) dropped to single digits after enabling Admin → Cloudflare → Edge caching → "cache pages for guests" (#74–75). Cached pages never reach origin (#71, #77); also serves attachments from R2. Widely re-recommended (Anthony #581, Rhody #671, Chromaniac #686).
  • [XTR] IP Threat Monitor (Osman, released 2025-11-21): the leading no-Cloudflare option for shared hosting / XF Cloud. Integrates proxycheck.io + firehol + MaxMind (country via local MaxMind to save API calls). Blocking granularity Country > ASN > Network > IP; whitelist trumps blacklist (Google/Bing never blocked); logged-in users always pass. smallwheels ran it blocking ~545 ASNs (#532); Anthony: "hosting has nothing to do with anything… it's just an addon" (#500). Config gotchas: 40k-query proxycheck plan burns fast; whitelist "Mullvad" for Mozilla VPN users (#557).
  • zeeb0t — XF Surge Guard → XF Bot Guard (free, $0): Surge Guard (2026-05-27, #510) reduces guest/bot load; Bot Guard (stable #572) is a native XF behaviour-based challenge. Signals: a FingerprintJS browser fingerprint used to "glue sessions together" across IPs, request velocity, JS completion, one-fingerprint-many-IPs / one-IP-many-fingerprints, error-page hits, CAPTCHA history → risk score → XF CAPTCHA (#516, #526, #801). Runs inside XenForo (sees routes/sessions/logged-in state); a standalone origin-side version (aiwebscraper.com heritage) runs in front of the app. Whitelists AI-search/user-triggered bots by default; targets training scrapers only. Roadmap: web-bot-auth beta (#802), GDPR-friendlier mode scoped/backlogged (#921). Result on a buried thread: 271 hits/day, not one passed the CAPTCHA (all bots, #803).
  • Sim — KnownBots (2018): definition-based bot detection; limited because a definition must exist first and bots are created faster than they can be classified. Sim's own stats: 232,000 unique UAs collected, 46,000+ identified as bots, 1,871 distinct bots (#18); still useful for custom rules (Levina #681).
  • Others: Ozzy's Spaminator + Xon's Registration/Multi-Account Blocker (spam regs); AndyB's Access log; dvduval's ChatGPT-built daily traffic-audit plugin (#900); the Supergatto "S Advanced Traffic Statistics" add-on that smallwheels suspects is massively AI-coded (#287).

6Detection frontier

  • Browser/TLS fingerprinting: JA3/JA4 (CF Enterprise + nginx-level, eva2000 #518); FingerprintJS (zeeb0t's Bot Guard, #516 — "thumbmark" was only floated as a separate library by ES Dev Team, #813). puterfixer (#782): TLS-capability fingerprinting can contradict a claimed UA — but zeeb0t (#783): some proxies charge extra to pick a TLS fingerprint.
  • proxycheck.io gaps: great on datacenters/VPNs, catches only ~50% of residential proxies (ES Dev Team #676); can't keep up with IP-rotation speed; proxy vendors share IP pools (Asocks ↔ IPRoyal/Oxylabs/evomi, #536, #558).
  • web-bot-auth draft spec: CF's cryptographic bot-key registry to verify well-behaved bots (smallwheels #228); zeeb0t adds beta support to Bot Guard (#802). Critiqued as market capture (BrettC #815) and a "credentialed-trust class problem" (zeeb0t #817).
  • Honeypots + latency fingerprinting: chillibear's honeypot URLs + request-mirroring (#206); ES Dev Team floats latency fingerprinting across a fleet as a plugin, blocked by needing memcached (won't run on shared hosting) (#877).
  • ES Dev Team's "php2ban" → shared IP-reputation network (the thread's most ambitious build): reveal at #93ClickHouse (async writes) + fail2ban + a PHP file ahead of the app bootloader traps malicious hits with ms-overhead and refers bans to iptables. Evolves into a shared, cross-site IP-reputation network (#851): every XenForo site submits scraper-wave IPs (minus CAPTCHA-passers); a central server tallies a weekly score; with 1000s of sites, "one IP visiting 5%+ of sites is very unlikely → essentially solve scraping." Engine internals: 225 bytes/IP (10M IPs = 2.25GB RAM), memcached ~57k ops/sec/core, memcached+extstore or DragonflyDB (~175 bytes/IP) to hold the ~400M active scraper set; explicitly mirrors the memcached+SQL architecture Facebook used to scale PHP (#884, #886, #912). (CrowdSec dismissed because "you can't form your own crowd," #660.)
  • Behavioural / cluster analysis (Anthony #902–905): heuristic clustering on months of logs surfaces residential-proxy clusters that look human individually; "100M IPs hitting one page then leaving IS a pattern" (#868).

7Ethics & SEO trade-offs

  • Allow AI crawlers at all? Splits the thread. Anthony Parsons (#797): "SEO died ~20 years ago"; his data — blocking AI-search bots dropped daily posts ~125 → 50, so don't block them (you lose members/referrals). BrettC/Wildcat/smallwheels: block training scrapers (content theft, no credit). zeeb0t: whitelist AI-search/user-triggered bots, block training scrapers — "there just needs to be a reasonable trade" (#799).
  • The "library" analogy (smallwheels #435): search bots feed the library index (send you visitors); AI bots read every book then answer/charge directly without buying them. Anthony (#438) counters with the Publishers-vs-Google-News history and the fear that AI clones content.
  • Court rulings: Bright Data won against Meta (Jan 2024) and had X/Twitter's suit dismissed (May 2024) — scraping public data is generally legal (smallwheels #259); Meta and X both failed to sue scrape.do in 2024 (#661). Anthony: rulings concern public, not private data (#260). Consensus takeaway: add explicit TOS/robots.txt anti-scraping statements — costs nothing, may help future rulings (#872).
  • Member-retention & registration walls: smallwheels put most of the forum behind a registration wall (guests see only the first post) → registrations tripled-to-quadrupled with no Google-rank drop (#425, #872, #876). Counter-risk: hiding content can hurt SEO and annoy users, and AI may already hold accounts (#254).
  • Ad-revenue / analytics pollution (Growlithe #894): bot "engagement" of 0–1 seconds pollutes Analytics and hurts ad performance; normal 1.5–2.2M pageviews/mo pushed to 3M+.
  • Privacy / deanonymization (smallwheels #254): AI aggregates fragmented cross-platform posts to de-anonymize pseudonymous users (cites an arXiv LLM-deanonymization paper).

8The XenForo-responsibility conflict

  • xenforo.com itself overrun: members repeatedly note XF's own forum hitting 17k → 190k–200k guests (#122, #726, #769; #353 TMMAC "out of control") and rendering pages in ~10s during a CF Bot-Management outage (#463). It recovers only when XF enables Under Attack Mode (~200k → 61 guests, #769). BrettC even gets "you have been blocked… xenforo.com" as a member (#718).
  • The flare-up (p46–47): Chris D (XenForo developer) enters (#909–917): mitigation belongs at the edge (Cloudflare/Anubis) — "once traffic reaches a DB-driven app it's too late"; app-layer solving is a "fool's errand"; "we provide the software; you operate the server"; long-cookie/fingerprinting add-ons are not GDPR-compatible so XF can't ship them in core; he's open to guest page-caching docs. Sharp note: smallwheels' config blocked Chris on vanilla Chrome from a normal ISP, and "the bot problem is easy if you just 403 everything."
  • smallwheels' rebuttal (#910, #916, #928): IP-based defense hits a ceiling with RESIPs → you need behavioural analytics, and only the app knows its own patterns, so put behavioural sensors in the app feeding a gateway that blocks ("sensors ≠ the app doing the blocking"); XF should at least publish best-practices docs. Stats: 570 ASNs + ~75 countries blocked, 99.5% DACH+Benelux audience, motive is content theft not DDoS; now migrating to a VPS with a dedicated firewall. Defends his competence (career IT-security since the late 1990s).
  • Anthony Parsons (#908, #922, #927): "I hope they don't" — solving bot behaviour is the add-on market's job, like WordPress; "this is why XF staff don't participate"; then softens to "a pickle of a problem."
  • ES Dev Team (#912, #918, #924, #926): middle ground — won't run current plugins over performance, but XF should build a "pit of success" and publish best practices; "worried about everyone else," not himself.
  • Close: zeeb0t de-escalates (#921–925), reframing it as a multi-layer problem; thread locked at #928 on smallwheels' conciliatory note.

9Meta — using AI to summarize the thread

Around #592–619, members feed the (then ~30-page) thread to ChatGPT-5.5 to produce a novice guide, with eva2000 suggesting multi-pass self-improvement and PDF output. The results draw sharp AI-slop critiques: smallwheels (#599) — "polished but typical AI: misses content, fills gaps with slop, v3 wrongly narrows the concern to photos, doesn't distinguish shared-host vs XF Cloud"; BrettC (#598) — factual errors (Linux isn't required; "spoofed bots" → "falsified user agents"); Mr Lucky (#600) — "can't verify what you didn't read." puterfixer (#618) offers prompt-engineering technique (persona + deliverables). Separately, Anthony rates Claude Opus / GPT-5.5 the best Cloudflare-rule auditors (#702).

Who’s who — key participants

Member Role / stance Notable contributions
Levina OP; small photography forum on XF Cloud Opens thread (#1); AI-ethics dilemma; journeys from "refuse CF" → IP Threat Monitor → Cloudflare
Anthony Parsons "Absorb/tune it" + Cloudflare pragmatist; ex-SEO /search + xf_user cookie managed-challenge rules; CentminMod/Redis/ES benchmarks; "don't block AI-search bots"; catalyst of the #922 flare-up
smallwheels Content-theft/privacy hawk; self-host/ASN 545–570 ASN + 75-country blocklists; RESIP-vendor & Qurium/stub-ASN forensics; library analogy; reg-wall advocate; main foil to Chris D & Anthony; posts the last message (#928)
ES Dev Team "Fight it" / app-layer; anti-CF-monopoly fail2ban "black belt" (~32–35 servers); php2ban (ClickHouse+fail2ban) → memcached/DragonflyDB shared IP-reputation network; "if nobody fights it we lose the indie internet"
BrettC Proof-of-work / DNS / self-host Anubis deep-dives & deployment; RADB cron; nginx 444; log forensics; IPv6 advocate; "AI/LLM scrapers are modern-day botnets"
zeeb0t Add-on author XF Surge Guard → Bot Guard ($0, FingerprintJS session-gluing, behavioural, web-bot-auth beta, GDPR mode); measured de-escalator; owns aiwebscraper.com
Osman Add-on author [XTR] IP Threat Monitor (proxycheck.io + MaxMind, ASN/country/VPN blocking) — the no-CF option
digitalpoint (Shawn) Add-on author; edge-first App for Cloudflare guest edge caching — the biggest single load fix (Andy.N #74)
Chris D XenForo developer (staff) Official line: shed at the edge; app-layer is a "fool's errand"; GDPR blocks core fingerprinting; open to guest-caching docs
eva2000 Cloudflare Enterprise/MVP; CentminMod Enterprise Bot Management/JA3-JA4; CF-aggregation dashboard auto-generating WAF rules; "never use UAM, know the WAF"
Sim (Simon Hampel) Add-on author KnownBots; bot-management scale stats (232k UAs)
dutchbb Self-host / CSF cPanel/CSF stack; 3–6,000 → 200–400 guests; South-America blocks
wwillson Cloudflare recipes UA-block rule (#37); 20k → 6k guests
puterfixer Behavioural/upstream; GDPR-cautious Double-visit pattern; TLS-fingerprinting; "SBO era"; AI-prompt technique
lazy llama Cloudflare-context / reporter crawl-to-refer & pay-per-crawl analysis; thread-number enumeration; one-shot RESIPs
webbouk Very large forum (>3.5M posts) 5.5M uniques/night; "problem is server hits, not guest count"; UAM 30k→430
Wildcat Media Anti-AI moral stance 5-rule CF "nuclear" setup; Zero Trust/Access; AI Labyrinth
Others Reporters / specialists Andy.N (488%-CPU case study), chillibear (behavioural/Markov/FreeBSD), Suzanne O (UAM/country challenge), Jja (pro-CF), Azaly (CF-blocked-in-country → second frontend), z3r010 (CF rule order), rdn, JustinHawk, Kirby (AI-revenue-loss), Digital Doctor, duderuud, Jake B., Growlithe (ad pollution), Ricsca/philmckrackon/cdub (fatalist camp)

Tools & techniques reference

Tool / technique What it does Advocated by
Cloudflare Managed Challenge (cookie//search/UA rules) Challenge unauthenticated/bad-UA guests Anthony, wwillson, zeeb0t, Wildcat
CF Under Attack Mode (UAM) Blanket JS challenge during waves (automatable via API) webbouk, z3r010, ES Dev Team
CF Enterprise Bot Management / JA3-JA4 / Cloudforce One ML + TLS fingerprinting vs RESIPs (~$2k/mo) eva2000, Anthony
CF AI Labyrinth Tar-pit maze for bad bots Wildcat
[DigitalPoint] App for Cloudflare Guest edge caching + R2 + ASN blocking digitalpoint, Andy.N, Anthony
[XTR] IP Threat Monitor proxycheck.io + MaxMind ASN/country/VPN blocking (no CF needed) Osman, smallwheels, Anthony
XF Bot Guard (was Surge Guard) In-app behavioural fingerprint risk scoring → CAPTCHA zeeb0t
KnownBots Definition-based bot flagging Sim
Anubis Proof-of-work WAF (difficulty 0–16) BrettC, ES Dev Team
fail2ban → iptables Log-driven IP/ASN bans ES Dev Team, dutchbb
CSF / cc_deny / firehol / AbuseIPDB / RADB Country/ASN/CIDR deny-lists & feeds dutchbb, BrettC, smallwheels
CentminMod + Redis page-cache + Elasticsearch Server tuning so bot load "doesn't matter" Anthony, eva2000
php2ban / shared IP-reputation network ClickHouse/memcached cross-site scraper scoring ES Dev Team
proxycheck.io IP reputation (weak on RESIPs) smallwheels, Levina, Anthony
web-bot-auth spec Cryptographic bot-identity registry zeeb0t, CF
Markov tarpits / content-poisoning / "bogopedia" Waste & poison scrapers chillibear, lazy llama, smallwheels
Registration wall (guests see first post only) Cut scraper value; boost registrations smallwheels
928
posts · 47 pages
Oct 2025 → Jul 2026, thread locked
4,800+
guests at onset
Levina (#1) — up from a few hundred
488%
MariaDB CPU in the Andy.N crisis
37k guests → single digits after edge caching (#75)
1.5M
CF challenge events in 24h
only ~2k solved (~0.13%) — Anthony (#322)
200k → 61
xenforo.com guests after UAM
the forum’s own self-attack (#769)
~10–50%
RESIP detection rate
proxycheck.io & free tiers miss most (#157, #676)
9% + 6.1%
of world bot traffic — two Amazon ASNs
Amazon/AWS is the #1 bot source (#591)
570
ASNs blocked by smallwheels
plus ~75 countries (#916)
0%
scraper success vs Anubis level 5
7-day flood, proof-of-work WAF (#179)
$0
cost of zeeb0t’s Bot Guard
271 hits/day on a buried thread; none passed (#803)
~$2,000/mo
CF Enterprise Bot Management
the only tier with the RESIP ML model (#828)
~$10–15/mo
to run a ~2M-post forum, tuned
“you just need access and optimisation” (#713)

Guest-count collapses — before → after each fix (log scale)

xenforo.com — Under Attack Mode #769
200,000
61
Andy.N — guest edge caching #75
37,000
single digits
webbouk — Under Attack Mode #697
30,000
430
wwillson — UA block rule #37
20,000
6,000
dutchbb — CSF country/ASN stack #443
6,000
400

Key metrics & events

  • Onset: Levina's guests 3,000 → 4,800+ (#1); dropped to 710 when the wave passed (#17).
  • Andy.N crisis: 37k guests, mariadb 488% CPU, load ~100 → single digits after digitalpoint edge caching (#63, #75).
  • Anthony's CF cookie rule: 1.5M events / ~2k solves in 24h (~0.13% solve rate) (#322); /search fix 1M+ → ~150k daily (#358); the vast majority of traffic was garbage (~85%+, #381 / page-20).
  • wwillson: guests 20,000 → 6,000 (#37). dutchbb: 3–6,000 → 200–400 (#443). webbouk UAM: 30k+ → 430 (#697).
  • smallwheels blocklists: 388 → 545 → 570 ASNs, ~75 countries (#213, #532, #916).
  • Amazon = #1 bot source: 9% + 6.1% of all worldwide bot traffic (two Amazon ASNs, #591). CF Radar: bots ~32–34% of traffic.
  • xenforo.com self-attack: peaks of 190k (#726), 196k (#729), 200k (#769); ~200k → 61 after UAM (#769).
  • Anthony's benchmarks: ~100 users/sec/core; 2-core/4GB handles 1M daily uniques; ~2M-post forum on ~$10–15/mo (#708–713). JustinHawk: 1M req/min on a $10 server (#629).
  • RESIP scale: vendors claim 110–440M IPs, "99.98% success"; "400M active scraper set" cited for the reputation engine (#886).
  • Anubis: level 5 → 0% scraper success on a 7-day flood (#179); ~90% of botnets culled (#370).
  • Ad pollution: normal 1.5–2.2M pageviews/mo → 3M+ from bots (#894).
  • End: thread locked at #928, 2026-07-04.

Practical takeaways

Consensus recommendations:

  1. You cannot fully stop distributed residential-proxy scraping — aim to make it cheap to absorb and expensive for the operator.
  2. Layer defenses: edge (Cloudflare rules / UAM / edge caching) first, then server (fail2ban / ASN blocks), then app-level (Bot Guard / IP Threat Monitor) as a backstop.
  3. Tune the server (page/edge caching, Redis, Elasticsearch, opcache) so bot load stops mattering — often cheaper and more durable than blocklists.
  4. Guest edge caching (digitalpoint) is the highest-leverage quick win for CF users; /search + xf_user cookie managed-challenge is the highest-leverage CF rule.
  5. On XF Cloud / shared hosting (no .htaccess): use Cloudflare and/or [XTR] IP Threat Monitor — those are the only real levers.
  6. Whitelist verified search bots (Googlebot via CF "Verified bots"); never hard-block by forged UA alone; beware Singapore blocks catching Google.
  7. Registration walls (guests see only the first post) cut scraper value and can raise registrations; add explicit anti-scraping TOS/robots.txt.

Open problems left unresolved:

  • Residential proxies remain ~50–90% undetectable by available tools; IPv6 makes per-IP blocking hopeless.
  • Cloudflare monopoly / pay-per-crawl economics worry many; free/Pro tiers can't stop RESIPs, and Enterprise (~$2k/mo) is out of reach for hobby forums.
  • Whose job is it? XenForo's official position (edge, not app; GDPR blocks core fingerprinting) leaves self-hosters to assemble their own stack — the unresolved tension the thread closes on. Whether XF's Cloud-side fix reaches self-hosted or the 2.4 release is left unanswered (#928).