Levy, Alan wrote:
I have a server that gets about 1M hits per day. Over the past week,
this has exploded and the server is using about 80% of the cpu. We
figure that someone is using a webcrawler since when we analyze the
tomcat logs, there are thousands of hits from one ip address (every day
I have a server that gets about 1M hits per day. Over the past week,
this has exploded and the server is using about 80% of the cpu. We
figure that someone is using a webcrawler since when we analyze the
tomcat logs, there are thousands of hits from one ip address (every day
it's a different
: /
-Original Message-
From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of
Levy, Alan
Sent: Wednesday, January 16, 2008 8:24 AM
To: LINUX-390@VM.MARIST.EDU
Subject: web crawling problem
I have a server that gets about 1M hits per day. Over the past week,
this has
First, make sure you have a robots.txt file that tells well-behaved indexers to
stop messing with you (that'll fix the yahoos and googles of the world, which
hammer machines unless told to stop).
Second, see if your network people can enable application-layer rate limiting
(that's the Cisco
Isn't a robots.txt only for apache (which I do not use) ?
-Original Message-
From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of
Robert Flynn
Sent: Wednesday, January 16, 2008 8:56 AM
To: LINUX-390@VM.MARIST.EDU
Subject: Re: web crawling problem
You need to put a robots.txt
On Wed, 16 Jan 2008 08:24:29 -0500
Levy, Alan [EMAIL PROTECTED] wrote:
I have a server that gets about 1M hits per day. Over the past week,
this has exploded and the server is using about 80% of the cpu. We
figure that someone is using a webcrawler since when we analyze the
tomcat logs,
On Wed, 16 Jan 2008 09:04:53 -0500
Levy, Alan [EMAIL PROTECTED] wrote:
Isn't a robots.txt only for apache (which I do not use) ?
The robots.txt file is just a text/plain file. The robots W3C stuff says
a properly behaving robot should pull /robots.txt and honour the contexts
if it is found. It
Robert Flynn wrote:
I am using IBM HTTP server and it works.
Of course, you also need to decide whether to keep all bots out. your
site will be rather hard to find without google, yahoo and others
crawling around.
Someplace suggests putting a bogus[1] path into robots.txt; it might
point to