Oh, great. Will try with 1.12, thanks.

2016-02-24 15:39 GMT+01:00 Markus Jelsma <[email protected]>:

> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is not in
> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that
> feature!
> Markus
>
>
>
> -----Original message-----
> > From:Tomasz <[email protected]>
> > Sent: Wednesday 24th February 2016 15:26
> > To: [email protected]
> > Subject: Re: Limit number of pages per host/domain
> >
> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11
> but
> > hostdb works only with 2.x I guess.
> >
> > Tomasz
> >
> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <[email protected]>:
> >
> > > Hello - this is possible using the HostDB. If you updatehostdb
> frequently
> > > you get statistics on number of fetched, redirs, 404's and unfetched
> for
> > > any given host. Using readhostdb and a Jexl expression, you can then
> emit a
> > > blacklist of hosts that you can use during generate.
> > >
> > > # Update the hostdb
> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/
> > >
> > > # Get list of hosts that have over 100 records fetched or not modified
> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >=
> > > 100)'
> > >
> > > # Or get list of hosts that have over 100 records in total
> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> > > '(numRecords >= 100)'
> > >
> > > List of fields that are expressible (line 93-104):
> > >
> > >
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
> > >
> > > You now have a list of hostnames that you can using with the
> > > domainblacklist-urlfilter at generate step.
> > >
> > > Markus
> > >
> > >
> > > -----Original message-----
> > > > From:Tomasz <[email protected]>
> > > > Sent: Wednesday 24th February 2016 11:30
> > > > To: [email protected]
> > > > Subject: Limit number of pages per host/domain
> > > >
> > > > Hello,
> > > >
> > > > One can set generate.max.count to limit number of urls for domain or
> host
> > > > in next fetch step. But is there a way to limit number of fetched
> urls
> > > for
> > > > domain/host in a whole crawl process? Supposing I run
> > > generate/fetch/update
> > > > cycle 6 times and want to limit number of urls per host to 100 urls
> > > (pages)
> > > > and not more in a whole crawldb. How can I achieve that?
> > > >
> > > > Regards,
> > > > Tomasz
> > > >
> > >
> >
>

Reply via email to