On Wed, Feb 25, 2026 at 5:47 AM Phong Thai <[email protected]> wrote:

> Yes, I’m aware of fake Googlebot traffic and that is a valid concern.
>
> I’m verifying Googlebot using reverse DNS (crawl-*.googlebot.com),
> and I do see both real Googlebot and a significant amount of
> cloud-provider traffic (Azure/AWS) spoofing UA.
>
> I already filter a large portion of obvious bot traffic at the
> network / firewall level, but the overload still occurs specifically
> when legitimate crawlers hit many hostnames in parallel.
>
> The difficulty is that with prefork, processes are spawned very early
> during vhost and rewrite evaluation, so even valid crawlers can
> exhaust memory before any application-level throttling applies.
>
> I’m trying to understand if there are Apache-level techniques
> to reduce rewrite / vhost routing cost per request,
> without blocking or misleading real Googlebot.
>
> On Wed, Feb 25, 2026, 4:15 PM Marc <[email protected]> wrote:
>
>> Are you sure it is googlebot and not fake bots. I have 400k requests per
>> day from shit Microsoft Azure mostly. Try and filter out the crap, so you
>> have more resources left for real traffic.
>>
>> I have a honeypot page with robots.txt,
>> 5m cron, everything >100 requests goes to ipset blacklist.
>> Everthing blacklisted is redicted to lightweight html only page.
>>
>> I think this only works on ipv4 as these are not abundant.
>>
>> PS. maybe crawl delay in robots?
>> PPS. upgrading also helps with performance
>>
>>
>> >
>> >
>> >       Hello,
>> >
>> >       I’m looking for advice on handling crawler-driven overload in an
>> > Apache
>> >       prefork environment.
>> >
>> >       Environment:
>> >       - Apache httpd with prefork MPM
>> >       - CentOS 7.4
>> >       - ~2 CPU / 4 GB RAM
>> >       - prefork must remain in use
>> >
>> >       Architecture summary:
>> >       - Multiple main domains
>> >       - Tens of thousands of very small sites, each with its own
>> hostname
>> >       - All hostnames are routed through a central VirtualHost using
>> >       vhost-level rewrite rules (no .htaccess)
>> >       - Each hostname maps dynamically to a directory such as:
>> >       /app/sites/{unique-sub-domain-slug}/
>> >
>> >       Under normal conditions the system behaves well.
>> >
>> >       Issue:
>> >       When Googlebot crawls these small sites, Apache load spikes
>> > severely
>> >       (load averages > 200). httpd processes grow rapidly and many sites
>> >       become unreachable until crawler activity subsides. Main domains
>> > remain
>> >       responsive during these events.
>> >
>> >       Steps already taken:
>> >       - All rewrite logic moved from .htaccess to VirtualHost
>> >       - AllowOverride disabled
>> >       - Conservative timeouts and connection limits applied
>> >       - Resources increased compared to previous smaller deployment
>> >
>> >       This same design handled ~150 sites reasonably well in the past.
>> > With a
>> >       much larger number of sites, overload now happens daily.
>> >
>> >       My questions:
>> >       - Is this a known failure mode of prefork under heavy crawler
>> > activity?
>> >       - Are there Apache-level techniques to limit crawler impact
>> without
>> >       blocking Googlebot?
>> >       - In similar setups, what usually becomes the bottleneck first:
>> > rewrite
>> >       processing, filesystem checks, or process spawning?
>> >
>> >       Any insight or real-world experience would be greatly appreciated.
>>
>
The solution is really to use the event mpm here - why are you bound to use
the prefork approach?

With prefork, the only way to scale is to pre-spawn tons of workers up to
80% of your available memory for httpd, and make sure the processes are not
killed.

Reply via email to