Great! Thanks a lot!

On Fri, Feb 20, 2015 at 1:41 PM, Mohammad Al-Mohsin <[email protected]> wrote:

> Hi Charith,
>
> I believe you should obey crawl-delay if it exists in robots.txt in order
> to be polite. You don't want to go faster than what's specified by a
> webmaster.
>
> So if a website (e.g. https://www.aoncadis.org/robots.txt) has 10 seconds
> crawl-delay, your configurations of 1 second will be
> considered impolite and you might get banned.
>
>
> Best regards,
> Mohammad Al-Mohsin
>
> On Fri, Feb 20, 2015 at 1:03 PM, Charith Wickramarachchi <
> [email protected]> wrote:
>
> > Hi,
> >
> > I am new to Nutch and trying to figure out the best configuration for
> > crawling. In my Nutch configuration. I have configured *f*
> > *etcher.threads.per.**queue* to be 2 and  fetcher.server.min.delay to be
> 2
> > seconds.
> >
> > So as per my understanding from documentation in this case Nutch won't do
> > more than 2 requests per each 2 seconds.  So the requests per second will
> > be averaged to 1 request per second.
> >
> > Is this correct? Even though in this case it ignore robots.txt. I assume
> 1
> > req/sec is a polite request rate for many servers.
> >
> > It will be great if you could give me a clarification.
> >
> >
> > Thanks,
> > Charith
> >
> >
> > --
> > Charith Dhanushka Wickramaarachchi
> >
> > Tel  +1 213 447 4253
> > Blog  http://charith.wickramaarachchi.org/
> > <http://charithwiki.blogspot.com/>
> > Twitter  @charithwiki <https://twitter.com/charithwiki>
> >
> > This communication may contain privileged or other confidential
> information
> > and is intended exclusively for the addressee/s. If you are not the
> > intended recipient/s, or believe that you may have
> > received this communication in error, please reply to the sender
> indicating
> > that fact and delete the copy you received and in addition, you should
> not
> > print, copy, retransmit, disseminate, or otherwise use the information
> > contained in this communication. Internet communications cannot be
> > guaranteed to be timely, secure, error or virus-free. The sender does not
> > accept liability for any errors or omissions
> >
>



-- 
Charith Dhanushka Wickramaarachchi

Tel  +1 213 447 4253
Blog  http://charith.wickramaarachchi.org/
<http://charithwiki.blogspot.com/>
Twitter  @charithwiki <https://twitter.com/charithwiki>

This communication may contain privileged or other confidential information
and is intended exclusively for the addressee/s. If you are not the
intended recipient/s, or believe that you may have
received this communication in error, please reply to the sender indicating
that fact and delete the copy you received and in addition, you should not
print, copy, retransmit, disseminate, or otherwise use the information
contained in this communication. Internet communications cannot be
guaranteed to be timely, secure, error or virus-free. The sender does not
accept liability for any errors or omissions

Reply via email to