Side note: fetcher.max.crawl.delay might help, to ignore pages with an
unusually high crawl delay.

https://wiki.apache.org/nutch/OptimizingCrawls
6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites
don't use this setting but a few (some malicious do). I have seen
crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay
variable will ignore pages with crawl delays > x. I usually set this to 10
seconds, default is 30. Even at 10 seconds if you have a lot of pages from
a site from which you can only crawl 1 page every 10 seconds it is going to
be slow. On the flip side, setting this to a low value will ignore and not
fetch those pages.

On Sat, Feb 21, 2015 at 12:22 AM, Charith Wickramarachchi <
[email protected]> wrote:

> Great! Thanks a lot!
>
> On Fri, Feb 20, 2015 at 1:41 PM, Mohammad Al-Mohsin <[email protected]> wrote:
>
> > Hi Charith,
> >
> > I believe you should obey crawl-delay if it exists in robots.txt in order
> > to be polite. You don't want to go faster than what's specified by a
> > webmaster.
> >
> > So if a website (e.g. https://www.aoncadis.org/robots.txt) has 10
> seconds
> > crawl-delay, your configurations of 1 second will be
> > considered impolite and you might get banned.
> >
> >
> > Best regards,
> > Mohammad Al-Mohsin
> >
> > On Fri, Feb 20, 2015 at 1:03 PM, Charith Wickramarachchi <
> > [email protected]> wrote:
> >
> > > Hi,
> > >
> > > I am new to Nutch and trying to figure out the best configuration for
> > > crawling. In my Nutch configuration. I have configured *f*
> > > *etcher.threads.per.**queue* to be 2 and  fetcher.server.min.delay to
> be
> > 2
> > > seconds.
> > >
> > > So as per my understanding from documentation in this case Nutch won't
> do
> > > more than 2 requests per each 2 seconds.  So the requests per second
> will
> > > be averaged to 1 request per second.
> > >
> > > Is this correct? Even though in this case it ignore robots.txt. I
> assume
> > 1
> > > req/sec is a polite request rate for many servers.
> > >
> > > It will be great if you could give me a clarification.
> > >
> > >
> > > Thanks,
> > > Charith
> > >
> > >
> > > --
> > > Charith Dhanushka Wickramaarachchi
> > >
> > > Tel  +1 213 447 4253
> > > Blog  http://charith.wickramaarachchi.org/
> > > <http://charithwiki.blogspot.com/>
> > > Twitter  @charithwiki <https://twitter.com/charithwiki>
> > >
> > > This communication may contain privileged or other confidential
> > information
> > > and is intended exclusively for the addressee/s. If you are not the
> > > intended recipient/s, or believe that you may have
> > > received this communication in error, please reply to the sender
> > indicating
> > > that fact and delete the copy you received and in addition, you should
> > not
> > > print, copy, retransmit, disseminate, or otherwise use the information
> > > contained in this communication. Internet communications cannot be
> > > guaranteed to be timely, secure, error or virus-free. The sender does
> not
> > > accept liability for any errors or omissions
> > >
> >
>
>
>
> --
> Charith Dhanushka Wickramaarachchi
>
> Tel  +1 213 447 4253
> Blog  http://charith.wickramaarachchi.org/
> <http://charithwiki.blogspot.com/>
> Twitter  @charithwiki <https://twitter.com/charithwiki>
>
> This communication may contain privileged or other confidential information
> and is intended exclusively for the addressee/s. If you are not the
> intended recipient/s, or believe that you may have
> received this communication in error, please reply to the sender indicating
> that fact and delete the copy you received and in addition, you should not
> print, copy, retransmit, disseminate, or otherwise use the information
> contained in this communication. Internet communications cannot be
> guaranteed to be timely, secure, error or virus-free. The sender does not
> accept liability for any errors or omissions
>

Reply via email to