Ah, i see. Well, this is not possible right now and making this work may not be very easy as Nutch doesn't store the state of a domain or host.

What you can do is periodically compute statistiscs on host or domain and add hosts or domains to the DomainBlackListFilter if they exceed your threshold. You must then use that filter together with the generator. It's some work but it will fix your issue.

Keep in mind, the current domain statistics tool only aggregates statistics for fetched and not modified pages per host or domain but you might want to include redirects as well.

On Wed, 11 Apr 2012 17:21:47 +0200, Anders Rask <[email protected]> wrote:
As I understand it, those properties will only limit the number of URLs
that are crawled per site for each time you run generate.

But since Nutch works in such a way that you need to do an infinite loop of generate/fetch in order to recrawl sites then the total number of URLs that are crawled for one site will not be limited by the generate.max.count
parameter. Am I right?


Best regards,
--Anders Rask
www.findwise.com

Den 11 april 2012 17:14 skrev Markus Jelsma <[email protected]>:

Check these properties:

560     <property>
561     <name>generate.max.count</name>
562     <value>-1</value>
563     <description>The maximum number of urls in a single
564     fetchlist. -1 if unlimited. The urls are counted according
565     to the value of the parameter generator.count.mode.
566     </description>
567     </property>
568
569     <property>
570     <name>generate.count.mode</name>
571     <value>host</value>
572     <description>Determines how the URLs are counted for
generator.max.count.
573 Default value is 'host' but can be 'domain'. Note that we do not
count
574     per IP in the new version of the Generator.
575     </description>
576     </property>



On Wednesday 11 April 2012 17:05:04 Anders Rask wrote:
> Hi!
>
> I would like to be able to limit how many pages Nutch crawls from a > specific site, either by specifying the total number of pages to crawl
from
> one site or by specifying a depth of how many links that should be
followed
> from the initial seed.
>
> I've been working with Nutch for some time now but haven't been able to > figure out how this can be achieved. So my question is: Is there any way
to
> configure Nutch for this, and if not are there any plans to implement
this
> functionality?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com

--
Markus Jelsma - CTO - Openindex


Reply via email to