Oh, great. Will try with 1.12, thanks. 2016-02-24 15:39 GMT+01:00 Markus Jelsma <[email protected]>:
> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is not in > the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that > feature! > Markus > > > > -----Original message----- > > From:Tomasz <[email protected]> > > Sent: Wednesday 24th February 2016 15:26 > > To: [email protected] > > Subject: Re: Limit number of pages per host/domain > > > > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11 > but > > hostdb works only with 2.x I guess. > > > > Tomasz > > > > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <[email protected]>: > > > > > Hello - this is possible using the HostDB. If you updatehostdb > frequently > > > you get statistics on number of fetched, redirs, 404's and unfetched > for > > > any given host. Using readhostdb and a Jexl expression, you can then > emit a > > > blacklist of hosts that you can use during generate. > > > > > > # Update the hostdb > > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/ > > > > > > # Get list of hosts that have over 100 records fetched or not modified > > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >= > > > 100)' > > > > > > # Or get list of hosts that have over 100 records in total > > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr > > > '(numRecords >= 100)' > > > > > > List of fields that are expressible (line 93-104): > > > > > > > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup > > > > > > You now have a list of hostnames that you can using with the > > > domainblacklist-urlfilter at generate step. > > > > > > Markus > > > > > > > > > -----Original message----- > > > > From:Tomasz <[email protected]> > > > > Sent: Wednesday 24th February 2016 11:30 > > > > To: [email protected] > > > > Subject: Limit number of pages per host/domain > > > > > > > > Hello, > > > > > > > > One can set generate.max.count to limit number of urls for domain or > host > > > > in next fetch step. But is there a way to limit number of fetched > urls > > > for > > > > domain/host in a whole crawl process? Supposing I run > > > generate/fetch/update > > > > cycle 6 times and want to limit number of urls per host to 100 urls > > > (pages) > > > > and not more in a whole crawldb. How can I achieve that? > > > > > > > > Regards, > > > > Tomasz > > > > > > > > > >

