Are you sure it is googlebot and not fake bots. I have 400k requests per day 
from shit Microsoft Azure mostly. Try and filter out the crap, so you have more 
resources left for real traffic.

I have a honeypot page with robots.txt,
5m cron, everything >100 requests goes to ipset blacklist.
Everthing blacklisted is redicted to lightweight html only page.

I think this only works on ipv4 as these are not abundant.

PS. maybe crawl delay in robots?
PPS. upgrading also helps with performance


> 
> 
>       Hello,
> 
>       I’m looking for advice on handling crawler-driven overload in an
> Apache
>       prefork environment.
> 
>       Environment:
>       - Apache httpd with prefork MPM
>       - CentOS 7.4
>       - ~2 CPU / 4 GB RAM
>       - prefork must remain in use
> 
>       Architecture summary:
>       - Multiple main domains
>       - Tens of thousands of very small sites, each with its own hostname
>       - All hostnames are routed through a central VirtualHost using
>       vhost-level rewrite rules (no .htaccess)
>       - Each hostname maps dynamically to a directory such as:
>       /app/sites/{unique-sub-domain-slug}/
> 
>       Under normal conditions the system behaves well.
> 
>       Issue:
>       When Googlebot crawls these small sites, Apache load spikes
> severely
>       (load averages > 200). httpd processes grow rapidly and many sites
>       become unreachable until crawler activity subsides. Main domains
> remain
>       responsive during these events.
> 
>       Steps already taken:
>       - All rewrite logic moved from .htaccess to VirtualHost
>       - AllowOverride disabled
>       - Conservative timeouts and connection limits applied
>       - Resources increased compared to previous smaller deployment
> 
>       This same design handled ~150 sites reasonably well in the past.
> With a
>       much larger number of sites, overload now happens daily.
> 
>       My questions:
>       - Is this a known failure mode of prefork under heavy crawler
> activity?
>       - Are there Apache-level techniques to limit crawler impact without
>       blocking Googlebot?
>       - In similar setups, what usually becomes the bottleneck first:
> rewrite
>       processing, filesystem checks, or process spawning?
> 
>       Any insight or real-world experience would be greatly appreciated.

Reply via email to