Re: web crawling problem

2008-01-17 Thread Mark Perry
Levy, Alan wrote: I have a server that gets about 1M hits per day. Over the past week, this has exploded and the server is using about 80% of the cpu. We figure that someone is using a webcrawler since when we analyze the tomcat logs, there are thousands of hits from one ip address (every day

web crawling problem

2008-01-16 Thread Levy, Alan
I have a server that gets about 1M hits per day. Over the past week, this has exploded and the server is using about 80% of the cpu. We figure that someone is using a webcrawler since when we analyze the tomcat logs, there are thousands of hits from one ip address (every day it's a different

Re: web crawling problem

2008-01-16 Thread Robert Flynn
: / -Original Message- From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of Levy, Alan Sent: Wednesday, January 16, 2008 8:24 AM To: LINUX-390@VM.MARIST.EDU Subject: web crawling problem I have a server that gets about 1M hits per day. Over the past week, this has

Re: web crawling problem

2008-01-16 Thread David Boyes
First, make sure you have a robots.txt file that tells well-behaved indexers to stop messing with you (that'll fix the yahoos and googles of the world, which hammer machines unless told to stop). Second, see if your network people can enable application-layer rate limiting (that's the Cisco

Re: web crawling problem

2008-01-16 Thread Levy, Alan
Isn't a robots.txt only for apache (which I do not use) ? -Original Message- From: Linux on 390 Port [mailto:[EMAIL PROTECTED] On Behalf Of Robert Flynn Sent: Wednesday, January 16, 2008 8:56 AM To: LINUX-390@VM.MARIST.EDU Subject: Re: web crawling problem You need to put a robots.txt

Re: web crawling problem

2008-01-16 Thread Alan Cox
On Wed, 16 Jan 2008 08:24:29 -0500 Levy, Alan [EMAIL PROTECTED] wrote: I have a server that gets about 1M hits per day. Over the past week, this has exploded and the server is using about 80% of the cpu. We figure that someone is using a webcrawler since when we analyze the tomcat logs,

Re: web crawling problem

2008-01-16 Thread Alan Cox
On Wed, 16 Jan 2008 09:04:53 -0500 Levy, Alan [EMAIL PROTECTED] wrote: Isn't a robots.txt only for apache (which I do not use) ? The robots.txt file is just a text/plain file. The robots W3C stuff says a properly behaving robot should pull /robots.txt and honour the contexts if it is found. It

Re: web crawling problem

2008-01-16 Thread John Summerfield
Robert Flynn wrote: I am using IBM HTTP server and it works. Of course, you also need to decide whether to keep all bots out. your site will be rather hard to find without google, yahoo and others crawling around. Someplace suggests putting a bogus[1] path into robots.txt; it might point to