Hello, see inline.

Regards,
Markus 
 
-----Original message-----
> From:Tomasz <polish.software.develo...@gmail.com>
> Sent: Tuesday 1st March 2016 18:07
> To: user@nutch.apache.org
> Subject: Re: Limit number of pages per host/domain
> 
> I've been running Nutch 1.12 for two days (btw. I noticed significant load
> drop during fetching comparing to 1.11, it dropped from 20 to 1.5 with 64
> fetchers running). Anyway, I tried to use domainblacklist plugin but it
> didn't work. This is what I did:

That is odd, to my knowledge, nothing we did in Nutch should cause the load to 
drop so much. Are you sure your fetchers stil have the same amount of work to 
do as before?
> 
> - I prepared the domain list with update/readhostdb,
> - cp domainblacklist-urlfilter.txt to conf/ directory,
> - enabled plugin in nutch-site.xml
> (<name>plugin.includes</name><value>urlfilter-domainblacklist|protocol-httpclient[....])
> - run generate command
> bin/nutch generate c1/crawldb c1/segments -topN 50000 -noFilter
> - started a fetch step...

Did you confirm readhostdb generated the output for the filter? Also, the 
-noFilter at the generate step disables filtering. Make sure you don't do 
urlfilter-regex and other heavy filters during generate and updatedb steps. Do 
it only on updatedb if you have changed the regex config file.

> 
> ...and nutch is still fetching urls from the blacklist. Did I miss
> something? Can -noFilter option interfere domainblacklist plugin? I guess
> -noFilter refers to regex-urlfilter, am I right? I can only seed in log
> that the plugin was properly activated:

-noFilter disables all filters. So without it, urlfilter-regex if you have it 
configured, will run as well, it is too heavy.

> 
> INFO  domainblacklist.DomainBlacklistURLFilter - Attribute "file" is
> defined for plugin urlfilter-domainblacklist as
> domainblacklist-urlfilter.txt

This should be fine.

> 
> Tomasz
> 
> 
> 2016-02-24 15:48 GMT+01:00 Tomasz <polish.software.develo...@gmail.com>:
> 
> > Oh, great. Will try with 1.12, thanks.
> >
> > 2016-02-24 15:39 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>:
> >
> >> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is not in
> >> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that
> >> feature!
> >> Markus
> >>
> >>
> >>
> >> -----Original message-----
> >> > From:Tomasz <polish.software.develo...@gmail.com>
> >> > Sent: Wednesday 24th February 2016 15:26
> >> > To: user@nutch.apache.org
> >> > Subject: Re: Limit number of pages per host/domain
> >> >
> >> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11
> >> but
> >> > hostdb works only with 2.x I guess.
> >> >
> >> > Tomasz
> >> >
> >> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>:
> >> >
> >> > > Hello - this is possible using the HostDB. If you updatehostdb
> >> frequently
> >> > > you get statistics on number of fetched, redirs, 404's and unfetched
> >> for
> >> > > any given host. Using readhostdb and a Jexl expression, you can then
> >> emit a
> >> > > blacklist of hosts that you can use during generate.
> >> > >
> >> > > # Update the hostdb
> >> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/
> >> > >
> >> > > # Get list of hosts that have over 100 records fetched or not modified
> >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >=
> >> > > 100)'
> >> > >
> >> > > # Or get list of hosts that have over 100 records in total
> >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> >> > > '(numRecords >= 100)'
> >> > >
> >> > > List of fields that are expressible (line 93-104):
> >> > >
> >> > >
> >> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
> >> > >
> >> > > You now have a list of hostnames that you can using with the
> >> > > domainblacklist-urlfilter at generate step.
> >> > >
> >> > > Markus
> >> > >
> >> > >
> >> > > -----Original message-----
> >> > > > From:Tomasz <polish.software.develo...@gmail.com>
> >> > > > Sent: Wednesday 24th February 2016 11:30
> >> > > > To: user@nutch.apache.org
> >> > > > Subject: Limit number of pages per host/domain
> >> > > >
> >> > > > Hello,
> >> > > >
> >> > > > One can set generate.max.count to limit number of urls for domain
> >> or host
> >> > > > in next fetch step. But is there a way to limit number of fetched
> >> urls
> >> > > for
> >> > > > domain/host in a whole crawl process? Supposing I run
> >> > > generate/fetch/update
> >> > > > cycle 6 times and want to limit number of urls per host to 100 urls
> >> > > (pages)
> >> > > > and not more in a whole crawldb. How can I achieve that?
> >> > > >
> >> > > > Regards,
> >> > > > Tomasz
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> 

Reply via email to