Ah, i see. Well, this is not possible right now and making this work
may not be very easy as Nutch doesn't store the state of a domain or
host.
What you can do is periodically compute statistiscs on host or domain
and add hosts or domains to the DomainBlackListFilter if they exceed
your threshold. You must then use that filter together with the
generator. It's some work but it will fix your issue.
Keep in mind, the current domain statistics tool only aggregates
statistics for fetched and not modified pages per host or domain but you
might want to include redirects as well.
On Wed, 11 Apr 2012 17:21:47 +0200, Anders Rask <[email protected]>
wrote:
As I understand it, those properties will only limit the number of
URLs
that are crawled per site for each time you run generate.
But since Nutch works in such a way that you need to do an infinite
loop of
generate/fetch in order to recrawl sites then the total number of
URLs that
are crawled for one site will not be limited by the
generate.max.count
parameter. Am I right?
Best regards,
--Anders Rask
www.findwise.com
Den 11 april 2012 17:14 skrev Markus Jelsma
<[email protected]>:
Check these properties:
560 <property>
561 <name>generate.max.count</name>
562 <value>-1</value>
563 <description>The maximum number of urls in a single
564 fetchlist. -1 if unlimited. The urls are counted according
565 to the value of the parameter generator.count.mode.
566 </description>
567 </property>
568
569 <property>
570 <name>generate.count.mode</name>
571 <value>host</value>
572 <description>Determines how the URLs are counted for
generator.max.count.
573 Default value is 'host' but can be 'domain'. Note that we do
not
count
574 per IP in the new version of the Generator.
575 </description>
576 </property>
On Wednesday 11 April 2012 17:05:04 Anders Rask wrote:
> Hi!
>
> I would like to be able to limit how many pages Nutch crawls from
a
> specific site, either by specifying the total number of pages to
crawl
from
> one site or by specifying a depth of how many links that should be
followed
> from the initial seed.
>
> I've been working with Nutch for some time now but haven't been
able to
> figure out how this can be achieved. So my question is: Is there
any way
to
> configure Nutch for this, and if not are there any plans to
implement
this
> functionality?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
--
Markus Jelsma - CTO - Openindex