Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11 but
hostdb works only with 2.x I guess.

Tomasz

2016-02-24 11:53 GMT+01:00 Markus Jelsma <[email protected]>:

> Hello - this is possible using the HostDB. If you updatehostdb frequently
> you get statistics on number of fetched, redirs, 404's and unfetched for
> any given host. Using readhostdb and a Jexl expression, you can then emit a
> blacklist of hosts that you can use during generate.
>
> # Update the hostdb
> bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/
>
> # Get list of hosts that have over 100 records fetched or not modified
> bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >=
> 100)'
>
> # Or get list of hosts that have over 100 records in total
> bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> '(numRecords >= 100)'
>
> List of fields that are expressible (line 93-104):
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
>
> You now have a list of hostnames that you can using with the
> domainblacklist-urlfilter at generate step.
>
> Markus
>
>
> -----Original message-----
> > From:Tomasz <[email protected]>
> > Sent: Wednesday 24th February 2016 11:30
> > To: [email protected]
> > Subject: Limit number of pages per host/domain
> >
> > Hello,
> >
> > One can set generate.max.count to limit number of urls for domain or host
> > in next fetch step. But is there a way to limit number of fetched urls
> for
> > domain/host in a whole crawl process? Supposing I run
> generate/fetch/update
> > cycle 6 times and want to limit number of urls per host to 100 urls
> (pages)
> > and not more in a whole crawldb. How can I achieve that?
> >
> > Regards,
> > Tomasz
> >
>

Reply via email to