RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-30 Thread Fuad Efendi
All The Best!!!
I am not a Nutch Developer... I think I found a bug in Nutch's "Robot"
implementation... Also, today I found a bug in their HTTP
implmentation... But I don't have any time to re-test, and, probably, to
add some code!
Good luck,
Fuad


-Original Message-
From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 30, 2005 12:35 AM
To: Fuad Efendi
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
Importance: Low



Thank you for your email.

>>> http://lucene.apache.org/nutch/bot.html;
>>> nutch-agent@lucene.apache.org)" 128.95.1.189

The Canadian was, for some unknown reason, the first to respond to this
complaint. Although it appears from our research there was a Canadian
contingency that assisted in the development of an earlier version of
the Nutch Crawler at UW. Regardless, this has nothing to do with the
Canadian's IP.

The very first post indicated the offending IP is in fact owned by the
University of Washington, as is referenced above. Who, by the way has
completely denied the error in their ways and is in fact alluding to the
fact the only reason Their crawler spent so much time at Our server was
due to OUR server configuration. Completely LAME UW, not the way to
handle this issue. The Ducks are trying to DUCK the issues here.

We suppose, because they feel they are a University, with virtually
unlimited bandwidth, they are allowed to swamp people's servers with
anything they want and anytime they desire.

We are disgusted an academic institution, a place of higher learning,
the source of education and ethics for young impressionable minds, would
attempt to pass-the-buck like this. Especially UW.

The facts are as follows...

1) 202 sessions on our main server since the 18th of this month.

2) 328 minutes and 08 seconds of crawl time from 128.95.1.189
turingc.cs.washington.edu since the 18th of this month.

3) Indeed this crawler is ignoring our robots.txt file, go have a look
at our file for yourself. It was installed the same day we noticed your
abusive crawler hitting our site with 30X the bandwidth and regularity
of all the Google bots combined. Despite the robots.txt file being in
place, your crawler continued to scan our server with obvious and
continuing disregard for the implemented robots.txt file. 

This is the Internet, and while people are free to do as they please
without harming others, this is NOT UW's friggin' playground. And
someone at UW should step-up and say, Sorry, instead trying to
pass-the-buck like some two-year old.

We expect nothing less than an apology from UW.

Best Regards,

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot LessC
$6/hr Professional Web Services http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions: http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited. If
you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.


>  Original Message 
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> From: "Fuad Efendi" <[EMAIL PROTECTED]>
> Date: Thu, September 29, 2005 10:33 pm
> To: <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> 
> N e t i q u e t t e
> Do not forget to attach logs.
> 
> 
> -----Original Message-
> From: Richard Z. Ward [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 28, 2005 6:15 AM
> To: [EMAIL PROTECTED]
> Cc: nutch-agent@lucene.apache.org
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> 
> 
> Sorry to barge in like this but I don't think you guys are making any 
> headway.
> 
> WebExpertsAmerica,
> 
> If you check the IP address on http://www.checkdomain.com, you will 
> see that the IP address 70.30.209.252 is managed by Rogers Cable of 
> Toronto. To make it easy, click on the following link:
> 
> http://www.checkd

RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-29 Thread WebExpertsAmerica

Thank you for your email.

>>> http://lucene.apache.org/nutch/bot.html;
>>> nutch-agent@lucene.apache.org)" 128.95.1.189

The Canadian was, for some unknown reason, the first to respond to this
complaint. Although it appears from our research there was a Canadian
contingency that assisted in the development of an earlier version of
the Nutch Crawler at UW. Regardless, this has nothing to do with the
Canadian's IP.

The very first post indicated the offending IP is in fact owned by the
University of Washington, as is referenced above. Who, by the way has
completely denied the error in their ways and is in fact alluding to
the fact the only reason Their crawler spent so much time at Our server
was due to OUR server configuration. Completely LAME UW, not the way to
handle this issue. The Ducks are trying to DUCK the issues here.

We suppose, because they feel they are a University, with virtually
unlimited bandwidth, they are allowed to swamp people's servers with
anything they want and anytime they desire.

We are disgusted an academic institution, a place of higher learning,
the source of education and ethics for young impressionable minds,
would attempt to pass-the-buck like this. Especially UW.

The facts are as follows...

1) 202 sessions on our main server since the 18th of this month.

2) 328 minutes and 08 seconds of crawl time from 128.95.1.189
turingc.cs.washington.edu since the 18th of this month.

3) Indeed this crawler is ignoring our robots.txt file, go have a look
at our file for yourself. It was installed the same day we noticed your
abusive crawler hitting our site with 30X the bandwidth and regularity
of all the Google bots combined. Despite the robots.txt file being in
place, your crawler continued to scan our server with obvious and
continuing disregard for the implemented robots.txt file. 

This is the Internet, and while people are free to do as they please
without harming others, this is NOT UW's friggin' playground. And
someone at UW should step-up and say, Sorry, instead trying to
pass-the-buck like some two-year old.

We expect nothing less than an apology from UW.

Best Regards,

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot Less©
$6/hr Professional Web Services
http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions:
http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.


>  Original Message 
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> From: "Fuad Efendi" <[EMAIL PROTECTED]>
> Date: Thu, September 29, 2005 10:33 pm
> To: <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> 
> N e t i q u e t t e
> Do not forget to attach logs.
> 
> 
> -Original Message-----
> From: Richard Z. Ward [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, September 28, 2005 6:15 AM
> To: [EMAIL PROTECTED]
> Cc: nutch-agent@lucene.apache.org
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> 
> 
> Sorry to barge in like this but I don't think you guys are making any
> headway.
> 
> WebExpertsAmerica,
> 
> If you check the IP address on http://www.checkdomain.com, you will see
> that the IP address 70.30.209.252 is managed by Rogers Cable of Toronto.
> To make it easy, click on the following link:
> 
> http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=70.30.209.252
> 
> Unfortunately, I must have deleted your first e-mail and I don't
> remember how you figured out the machine lives at the University of
> Washington. From what I can see, the machine that is attacking you is
> likely in Toronto.
> 
> In any case, my web server has been attacked before (not by a machine
> pretending to be a nutch-agent) and the way to eventually stop it is to
> send an e-mail with your web logs to the abuse e-mail address of the
> company that manages the IP address. In your case, I

[sin #177] [6293] Your Nutch Crawler is Out of Control - Apache Notified (fwd)

2005-09-28 Thread Erik Lundberg
Dear Web Experts America,

Please see the message below, regarding your complaint about a Nutch
Crawler running on host '[EMAIL PROTECTED]'.

If you can provide us with more detailed information about the
incident, we can investigate further.  

 Erik Lundberg
 Director, CS Laboratory
 Department of Computer Science & Engineering
 University of Washington

 -- Original Message --
 Date: Fri, 23 Sep 2005 12:25:49 -0700
 From: WebExpertsAmerica <[EMAIL PROTECTED]>
 To: [EMAIL PROTECTED], [EMAIL PROTECTED]
 Cc: nutch-agent@lucene.apache.org
 Subject: Your Nutch Crawler is Out of Control - Apache Notified

 You crawler is ignoring our robots.txt file.

 http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
 128.95.1.189

 You are eating bandwidth at our domain in incredible amounts. This is
 rude.

 Please stop or we will be forced to block your IP and the crawler you
 are using.

 Best Regards,

 Web Experts America
 -


 Forwarded Message

This may refer to a crawling task I ran intermittently over the last
three weeks.  We're definitely observing robots.txt, with code that's
been widely tested.  (Nutch is an Apache project that's been around
for 3 years.)

It's possible there's a bug in the robots code, but I'd find that
somewhat surprising.  The only other thing I can think of is that
WebExpertsAmerica is a Search Engine Optimization company, and they
might be doing something slightly tricky or unusual that confuses
Nutch's politeness guarantees.

It's hard for me to say much else (eg, how many of their pages we
actually crawled, whether this is a widely-seen problem) without a
little more info (eg, what domains they're complaining about, what
kinds of other complaints we might have received).  I'm happy to talk
to you or anyone at CAC about needed further action.

Note that the task has been complete for some time, and I have no more
crawling plans anytime soon.

   --Mike



RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-28 Thread Richard Z. Ward
Sorry to barge in like this but I don't think you guys are making any
headway.

WebExpertsAmerica,

If you check the IP address on http://www.checkdomain.com, you will see that
the IP address 70.30.209.252 is managed by Rogers Cable of Toronto. To make
it easy, click on the following link:

http://www.checkdomain.com/cgi-bin/checkdomain.pl?domain=70.30.209.252

Unfortunately, I must have deleted your first e-mail and I don't remember
how you figured out the machine lives at the University of Washington. From
what I can see, the machine that is attacking you is likely in Toronto.

In any case, my web server has been attacked before (not by a machine
pretending to be a nutch-agent) and the way to eventually stop it is to send
an e-mail with your web logs to the abuse e-mail address of the company that
manages the IP address. In your case, I advise that you send an e-mail to
[EMAIL PROTECTED] You can see this e-mail address by clicking on the second
link above and then by clicking on the link "NET-70-30-209-1".

In your e-mail, specify the IP address that is the source of the attack,
include your web logs, making sure the logs clearly show the IP address
70.30.209.252 is the source of the attacks. You will likely get an automated
response and eventually (it could take a week or more) the attacks will
cease.

Good luck -- hopefully, this will end the attacks on your web server.

Richard

-Original Message-
From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 27, 2005 11:44 PM
To: Wild Dancer
Cc: nutch-agent@lucene.apache.org; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified


With all due respect, who the hell are you? 

Why is a Canadian emailing us about a server located at UW?

Why is a UW webserver configured with Nutch (or aliased as Nutch)
ignoring our robots.txt file.

Something smells...

Best Regards,

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot Less)
$6/hr Professional Web Services
http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions:
http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.


>  Original Message 
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> From: "Wild Dancer" <[EMAIL PROTECTED]>
> Date: Tue, September 27, 2005 11:17 pm
> To: "'WebExpertsAmerica'" <[EMAIL PROTECTED]>
> Cc: , <[EMAIL PROTECTED]>,
> <[EMAIL PROTECTED]>
> 
> N e t i q u e t t e
> 
> 
> 1. Someone uses "Nutch..." as an Agent Identity
> 2. Someone does not obey Netiquette
> 
> Nothing related to Nutch... This guy can use "Teleport Pro" as an
> identity, or even 
> User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
> CLR 1.1.4322)
> 
> 
> Simply, block their IP.
> 
> 
> 
> 
> -----Original Message-
> From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 27, 2005 12:40 AM
> To: Wild Dancer
> Cc: nutch-agent@lucene.apache.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> Importance: High
> 
> 
> 
> And you ignore our robots text file - what sort of game is this?
> Crawling our site for 3 hours every day. 
> 
> And... why is this email coming from a private account in Canada and not
> a university account where the server is located?
> 
> Here is your IP...
> 
>   70.30.209.252
> 
> Stop your crawler from hitting our servers!
> 
> The rule is, you follow the rules, and obey our robots.txt file!
> 
> What sort of arrogant techie attitude is this - we would expect much
> more from UW!
> 
> Web Experts America
> 
> >>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
> WebExpertsAmerica.co

RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-27 Thread WebExpertsAmerica

With all due respect, who the hell are you? 

Why is a Canadian emailing us about a server located at UW?

Why is a UW webserver configured with Nutch (or aliased as Nutch)
ignoring our robots.txt file.

Something smells...

Best Regards,

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot Less©
$6/hr Professional Web Services
http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions:
http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.


> ---- Original Message 
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> From: "Wild Dancer" <[EMAIL PROTECTED]>
> Date: Tue, September 27, 2005 11:17 pm
> To: "'WebExpertsAmerica'" <[EMAIL PROTECTED]>
> Cc: , <[EMAIL PROTECTED]>,
> <[EMAIL PROTECTED]>
> 
> N e t i q u e t t e
> 
> 
> 1. Someone uses "Nutch..." as an Agent Identity
> 2. Someone does not obey Netiquette
> 
> Nothing related to Nutch... This guy can use "Teleport Pro" as an
> identity, or even 
> User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
> CLR 1.1.4322)
> 
> 
> Simply, block their IP.
> 
> 
> 
> 
> -----Original Message-
> From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 27, 2005 12:40 AM
> To: Wild Dancer
> Cc: nutch-agent@lucene.apache.org; [EMAIL PROTECTED];
> [EMAIL PROTECTED]
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> Importance: High
> 
> 
> 
> And you ignore our robots text file - what sort of game is this?
> Crawling our site for 3 hours every day. 
> 
> And... why is this email coming from a private account in Canada and not
> a university account where the server is located?
> 
> Here is your IP...
> 
>   70.30.209.252
> 
> Stop your crawler from hitting our servers!
> 
> The rule is, you follow the rules, and obey our robots.txt file!
> 
> What sort of arrogant techie attitude is this - we would expect much
> more from UW!
> 
> Web Experts America
> 
> >>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
> WebExpertsAmerica.com
> Whole Lot More for a Whole Lot LessC
> $6/hr Professional Web Services http://www.WebExpertsAmerica.com
> 
> Testimonials:
> http://www.WebExpertsAmerica.com/testimonials.htm
> 
> Website Solutions: http://www.WebExpertsAmerica.com/services.htm
> 
> Chat:
> WebExpertsNOW
> AOL, MSN (Hotmail), and Yahoo
> *Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
> that adds us to their Friend/Buddy list. Nothing personal, a security
> policy to protect our chat connectivity from competitor abuse.
> 
> Terms of Service:
> http://www.WebExpertsAmerica.com/tos.htm
> 
> Confidential:
> The information contained in this message is privileged and confidential
> and protected from disclosure. If the reader of this message is not the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this communication is strictly prohibited. If
> you have received this communication in error, please notify us
> immediately by replying to this message and then delete it from your
> computer.
> 
> 
> >  Original Message 
> > Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> > From: "Wild Dancer" <[EMAIL PROTECTED]>
> > Date: Mon, September 26, 2005 11:18 pm
> > To: 
> > Cc: <[EMAIL PROTECTED]>
> > 
> > Obviously, Web Experts have very bad UPload bandwidth.
> > 
> > Frankly, classic installation of Apache with 150 "connections" will 
> > fail against 15 threads of Nutch, nothing related to a bandwidth, even
> 
> > if it is

RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-27 Thread Wild Dancer
N e t i q u e t t e


1. Someone uses "Nutch..." as an Agent Identity
2. Someone does not obey Netiquette

Nothing related to Nutch... This guy can use "Teleport Pro" as an
identity, or even 
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR 1.1.4322)


Simply, block their IP.




-Original Message-
From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 27, 2005 12:40 AM
To: Wild Dancer
Cc: nutch-agent@lucene.apache.org; [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
Importance: High



And you ignore our robots text file - what sort of game is this?
Crawling our site for 3 hours every day. 

And... why is this email coming from a private account in Canada and not
a university account where the server is located?

Here is your IP...

  70.30.209.252

Stop your crawler from hitting our servers!

The rule is, you follow the rules, and obey our robots.txt file!

What sort of arrogant techie attitude is this - we would expect much
more from UW!

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot LessC
$6/hr Professional Web Services http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions: http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited. If
you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.


>  Original Message 
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> From: "Wild Dancer" <[EMAIL PROTECTED]>
> Date: Mon, September 26, 2005 11:18 pm
> To: 
> Cc: <[EMAIL PROTECTED]>
> 
> Obviously, Web Experts have very bad UPload bandwidth.
> 
> Frankly, classic installation of Apache with 150 "connections" will 
> fail against 15 threads of Nutch, nothing related to a bandwidth, even

> if it is 8Mbps/800kbps for home-based sites.
> 
> May be Web Experts need to tune Apache Web Server, and use "worker" 
> model instead of "pre-fork"? It allows to handle 6000 concurrent users

> (1024 RAM)... It saves memory using threads instead of processes...
> 
> 
> -----Original Message-----
> From: WebExpertsAmerica [mailto:[EMAIL PROTECTED]
> Sent: Friday, September 23, 2005 3:26 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Cc: nutch-agent@lucene.apache.org
> Subject: Your Nutch Crawler is Out of Control - Apache Notified
> Importance: High
> 
> 
> 
> You crawler is ignoring our robots.txt file.
> 
> http://lucene.apache.org/nutch/bot.html; 
> nutch-agent@lucene.apache.org)" 128.95.1.189
> 
> You are eating bandwidth at our domain in incredible amounts. This is 
> rude.
> 
> Please stop or we will be forced to block your IP and the crawler you 
> are using.
> 
> Best Regards,
> 
> Web Experts America
> 
> >>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
> WebExpertsAmerica.com
> Whole Lot More for a Whole Lot LessC
> $6/hr Professional Web Services http://www.WebExpertsAmerica.com
> 
> Testimonials: http://www.WebExpertsAmerica.com/testimonials.htm
> 
> Website Solutions: http://www.WebExpertsAmerica.com/services.htm
> 
> Chat:
> WebExpertsNOW
> AOL, MSN (Hotmail), and Yahoo
> *Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone 
> that adds us to their Friend/Buddy list. Nothing personal, a security 
> policy to protect our chat connectivity from competitor abuse.
> 
> Terms of Service:
> http://www.WebExpertsAmerica.com/tos.htm
> 
> Confidential:
> The information contained in this message is privileged and 
> confidential and protected from disclosure. If the reader of this 
> message is not the intended recipient, you are hereby notified that 
> any dissemination, distribution or copying of this communication is 
> strictly prohibited. If you have received this communication in error,

> please notify us immediately by replying to this message and then 
> delete it from your computer.



RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-27 Thread Wild Dancer

Obviously, Web Experts have very bad UPload bandwidth.

Frankly, classic installation of Apache with 150 "connections" will fail
against 15 threads of Nutch, nothing related to a bandwidth, even if it
is 8Mbps/800kbps for home-based sites.

May be Web Experts need to tune Apache Web Server, and use "worker"
model instead of "pre-fork"? It allows to handle 6000 concurrent users
(1024 RAM)... It saves memory using threads instead of processes...


-Original Message-
From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 23, 2005 3:26 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: nutch-agent@lucene.apache.org
Subject: Your Nutch Crawler is Out of Control - Apache Notified
Importance: High



You crawler is ignoring our robots.txt file.

http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
128.95.1.189

You are eating bandwidth at our domain in incredible amounts. This is
rude. 

Please stop or we will be forced to block your IP and the crawler you
are using.

Best Regards,

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot LessC
$6/hr Professional Web Services http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions: http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited. If
you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.






RE: Your Nutch Crawler is Out of Control - Apache Notified

2005-09-26 Thread WebExpertsAmerica

And you ignore our robots text file - what sort of game is this?
Crawling our site for 3 hours every day. 

And... why is this email coming from a private account in Canada and not
a university account where the server is located?

Here is your IP...

  70.30.209.252

Stop your crawler from hitting our servers!

The rule is, you follow the rules, and obey our robots.txt file!

What sort of arrogant techie attitude is this - we would expect much
more from UW!

Web Experts America

>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot Less©
$6/hr Professional Web Services
http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions:
http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.


> ---- Original Message 
> Subject: RE: Your Nutch Crawler is Out of Control - Apache Notified
> From: "Wild Dancer" <[EMAIL PROTECTED]>
> Date: Mon, September 26, 2005 11:18 pm
> To: 
> Cc: <[EMAIL PROTECTED]>
> 
> Obviously, Web Experts have very bad UPload bandwidth.
> 
> Frankly, classic installation of Apache with 150 "connections" will fail
> against 15 threads of Nutch, nothing related to a bandwidth, even if it
> is 8Mbps/800kbps for home-based sites.
> 
> May be Web Experts need to tune Apache Web Server, and use "worker"
> model instead of "pre-fork"? It allows to handle 6000 concurrent users
> (1024 RAM)... It saves memory using threads instead of processes...
> 
> 
> -Original Message-----
> From: WebExpertsAmerica [mailto:[EMAIL PROTECTED] 
> Sent: Friday, September 23, 2005 3:26 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Cc: nutch-agent@lucene.apache.org
> Subject: Your Nutch Crawler is Out of Control - Apache Notified
> Importance: High
> 
> 
> 
> You crawler is ignoring our robots.txt file.
> 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
> 128.95.1.189
> 
> You are eating bandwidth at our domain in incredible amounts. This is
> rude. 
> 
> Please stop or we will be forced to block your IP and the crawler you
> are using.
> 
> Best Regards,
> 
> Web Experts America
> 
> >>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<
> WebExpertsAmerica.com
> Whole Lot More for a Whole Lot LessC
> $6/hr Professional Web Services http://www.WebExpertsAmerica.com
> 
> Testimonials:
> http://www.WebExpertsAmerica.com/testimonials.htm
> 
> Website Solutions: http://www.WebExpertsAmerica.com/services.htm
> 
> Chat:
> WebExpertsNOW
> AOL, MSN (Hotmail), and Yahoo
> *Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
> that adds us to their Friend/Buddy list. Nothing personal, a security
> policy to protect our chat connectivity from competitor abuse.
> 
> Terms of Service:
> http://www.WebExpertsAmerica.com/tos.htm
> 
> Confidential:
> The information contained in this message is privileged and confidential
> and protected from disclosure. If the reader of this message is not the
> intended recipient, you are hereby notified that any dissemination,
> distribution or copying of this communication is strictly prohibited. If
> you have received this communication in error, please notify us
> immediately by replying to this message and then delete it from your
> computer.



Your Nutch Crawler is Out of Control - Apache Notified

2005-09-26 Thread WebExpertsAmerica

You crawler is ignoring our robots.txt file.

http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)"
128.95.1.189

You are eating bandwidth at our domain in incredible amounts. This is
rude. 

Please stop or we will be forced to block your IP and the crawler you
are using.

Best Regards,

Web Experts America

>>><<
WebExpertsAmerica.com
Whole Lot More for a Whole Lot Less©
$6/hr Professional Web Services
http://www.WebExpertsAmerica.com

Testimonials:
http://www.WebExpertsAmerica.com/testimonials.htm

Website Solutions:
http://www.WebExpertsAmerica.com/services.htm

Chat:
WebExpertsNOW
AOL, MSN (Hotmail), and Yahoo
*Contact us anytime via chat. However, we DENY, BLOCK, and BAN anyone
that adds us to their Friend/Buddy list. Nothing personal, a security
policy to protect our chat connectivity from competitor abuse.

Terms of Service:
http://www.WebExpertsAmerica.com/tos.htm

Confidential:
The information contained in this message is privileged and confidential
and protected from disclosure. If the reader of this message is not the
intended recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
If you have received this communication in error, please notify us
immediately by replying to this message and then delete it from your
computer.