Yes, I’m aware of fake Googlebot traffic and that is a valid concern.

I’m verifying Googlebot using reverse DNS (crawl-*.googlebot.com),
and I do see both real Googlebot and a significant amount of
cloud-provider traffic (Azure/AWS) spoofing UA.

I already filter a large portion of obvious bot traffic at the
network / firewall level, but the overload still occurs specifically
when legitimate crawlers hit many hostnames in parallel.

The difficulty is that with prefork, processes are spawned very early
during vhost and rewrite evaluation, so even valid crawlers can
exhaust memory before any application-level throttling applies.

I’m trying to understand if there are Apache-level techniques
to reduce rewrite / vhost routing cost per request,
without blocking or misleading real Googlebot.

On Wed, Feb 25, 2026, 4:15 PM Marc <[email protected]> wrote:

> Are you sure it is googlebot and not fake bots. I have 400k requests per
> day from shit Microsoft Azure mostly. Try and filter out the crap, so you
> have more resources left for real traffic.
>
> I have a honeypot page with robots.txt,
> 5m cron, everything >100 requests goes to ipset blacklist.
> Everthing blacklisted is redicted to lightweight html only page.
>
> I think this only works on ipv4 as these are not abundant.
>
> PS. maybe crawl delay in robots?
> PPS. upgrading also helps with performance
>
>
> >
> >
> >       Hello,
> >
> >       I’m looking for advice on handling crawler-driven overload in an
> > Apache
> >       prefork environment.
> >
> >       Environment:
> >       - Apache httpd with prefork MPM
> >       - CentOS 7.4
> >       - ~2 CPU / 4 GB RAM
> >       - prefork must remain in use
> >
> >       Architecture summary:
> >       - Multiple main domains
> >       - Tens of thousands of very small sites, each with its own hostname
> >       - All hostnames are routed through a central VirtualHost using
> >       vhost-level rewrite rules (no .htaccess)
> >       - Each hostname maps dynamically to a directory such as:
> >       /app/sites/{unique-sub-domain-slug}/
> >
> >       Under normal conditions the system behaves well.
> >
> >       Issue:
> >       When Googlebot crawls these small sites, Apache load spikes
> > severely
> >       (load averages > 200). httpd processes grow rapidly and many sites
> >       become unreachable until crawler activity subsides. Main domains
> > remain
> >       responsive during these events.
> >
> >       Steps already taken:
> >       - All rewrite logic moved from .htaccess to VirtualHost
> >       - AllowOverride disabled
> >       - Conservative timeouts and connection limits applied
> >       - Resources increased compared to previous smaller deployment
> >
> >       This same design handled ~150 sites reasonably well in the past.
> > With a
> >       much larger number of sites, overload now happens daily.
> >
> >       My questions:
> >       - Is this a known failure mode of prefork under heavy crawler
> > activity?
> >       - Are there Apache-level techniques to limit crawler impact without
> >       blocking Googlebot?
> >       - In similar setups, what usually becomes the bottleneck first:
> > rewrite
> >       processing, filesystem checks, or process spawning?
> >
> >       Any insight or real-world experience would be greatly appreciated.
>

Reply via email to