Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is not in the 
1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that feature!
Markus

 
 
-----Original message-----
> From:Tomasz <polish.software.develo...@gmail.com>
> Sent: Wednesday 24th February 2016 15:26
> To: user@nutch.apache.org
> Subject: Re: Limit number of pages per host/domain
> 
> Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11 but
> hostdb works only with 2.x I guess.
> 
> Tomasz
> 
> 2016-02-24 11:53 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>:
> 
> > Hello - this is possible using the HostDB. If you updatehostdb frequently
> > you get statistics on number of fetched, redirs, 404's and unfetched for
> > any given host. Using readhostdb and a Jexl expression, you can then emit a
> > blacklist of hosts that you can use during generate.
> >
> > # Update the hostdb
> > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/
> >
> > # Get list of hosts that have over 100 records fetched or not modified
> > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >=
> > 100)'
> >
> > # Or get list of hosts that have over 100 records in total
> > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> > '(numRecords >= 100)'
> >
> > List of fields that are expressible (line 93-104):
> >
> > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
> >
> > You now have a list of hostnames that you can using with the
> > domainblacklist-urlfilter at generate step.
> >
> > Markus
> >
> >
> > -----Original message-----
> > > From:Tomasz <polish.software.develo...@gmail.com>
> > > Sent: Wednesday 24th February 2016 11:30
> > > To: user@nutch.apache.org
> > > Subject: Limit number of pages per host/domain
> > >
> > > Hello,
> > >
> > > One can set generate.max.count to limit number of urls for domain or host
> > > in next fetch step. But is there a way to limit number of fetched urls
> > for
> > > domain/host in a whole crawl process? Supposing I run
> > generate/fetch/update
> > > cycle 6 times and want to limit number of urls per host to 100 urls
> > (pages)
> > > and not more in a whole crawldb. How can I achieve that?
> > >
> > > Regards,
> > > Tomasz
> > >
> >
> 

Reply via email to