Yes, I'm sure fetchers have the same amount of work. Now, I've just doubled fetchers to 128, load raised to 3.6, but still it is a huge change comparing to 20.0 (with only 64 threads). There is one difference - I set up Nutch 1.11 on compiled bin version downloaded from Apache but with 1.12 I downloaded the source code and built it myself on the machine using ant since there is no bin available yet. Maybe this is all about. Not sure, I don't know JVM.
readhostdb generated a few files. I merged them and saved as conf/domainblacklist-urlfilter.txt and the list seems to be reliable. Sorry, but I'm not sure what do you mean asking "confirm readhostdb generated the output for the filter". You're saying that -noFilter is blamed for not using domainblacklist-urlfilter.txt That makes sense. Url filtering is active only during a parsing step which is done by fetcher (I don't store the content). parse.filter.urls = true so, the filtering is disabled on generate and update step. What I'm going to do is to run generate step without -noFilter and replace regex-urlfilter file with some light regex for that moment since my orginal regex-urlfilter is too heavy. Thanks Markus for all your hints :) 2016-03-01 20:32 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>: > Hello, see inline. > > Regards, > Markus > > -----Original message----- > > From:Tomasz <polish.software.develo...@gmail.com> > > Sent: Tuesday 1st March 2016 18:07 > > To: user@nutch.apache.org > > Subject: Re: Limit number of pages per host/domain > > > > I've been running Nutch 1.12 for two days (btw. I noticed significant > load > > drop during fetching comparing to 1.11, it dropped from 20 to 1.5 with 64 > > fetchers running). Anyway, I tried to use domainblacklist plugin but it > > didn't work. This is what I did: > > That is odd, to my knowledge, nothing we did in Nutch should cause the > load to drop so much. Are you sure your fetchers stil have the same amount > of work to do as before? > > > > - I prepared the domain list with update/readhostdb, > > - cp domainblacklist-urlfilter.txt to conf/ directory, > > - enabled plugin in nutch-site.xml > > > (<name>plugin.includes</name><value>urlfilter-domainblacklist|protocol-httpclient[....]) > > - run generate command > > bin/nutch generate c1/crawldb c1/segments -topN 50000 -noFilter > > - started a fetch step... > > Did you confirm readhostdb generated the output for the filter? Also, the > -noFilter at the generate step disables filtering. Make sure you don't do > urlfilter-regex and other heavy filters during generate and updatedb steps. > Do it only on updatedb if you have changed the regex config file. > > > > > ...and nutch is still fetching urls from the blacklist. Did I miss > > something? Can -noFilter option interfere domainblacklist plugin? I guess > > -noFilter refers to regex-urlfilter, am I right? I can only seed in log > > that the plugin was properly activated: > > -noFilter disables all filters. So without it, urlfilter-regex if you have > it configured, will run as well, it is too heavy. > > > > > INFO domainblacklist.DomainBlacklistURLFilter - Attribute "file" is > > defined for plugin urlfilter-domainblacklist as > > domainblacklist-urlfilter.txt > > This should be fine. > > > > > Tomasz > > > > > > 2016-02-24 15:48 GMT+01:00 Tomasz <polish.software.develo...@gmail.com>: > > > > > Oh, great. Will try with 1.12, thanks. > > > > > > 2016-02-24 15:39 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>: > > > > > >> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is > not in > > >> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that > > >> feature! > > >> Markus > > >> > > >> > > >> > > >> -----Original message----- > > >> > From:Tomasz <polish.software.develo...@gmail.com> > > >> > Sent: Wednesday 24th February 2016 15:26 > > >> > To: user@nutch.apache.org > > >> > Subject: Re: Limit number of pages per host/domain > > >> > > > >> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch > 1.11 > > >> but > > >> > hostdb works only with 2.x I guess. > > >> > > > >> > Tomasz > > >> > > > >> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma < > markus.jel...@openindex.io>: > > >> > > > >> > > Hello - this is possible using the HostDB. If you updatehostdb > > >> frequently > > >> > > you get statistics on number of fetched, redirs, 404's and > unfetched > > >> for > > >> > > any given host. Using readhostdb and a Jexl expression, you can > then > > >> emit a > > >> > > blacklist of hosts that you can use during generate. > > >> > > > > >> > > # Update the hostdb > > >> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb > crawl/crawldb/ > > >> > > > > >> > > # Get list of hosts that have over 100 records fetched or not > modified > > >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr > '(ok >= > > >> > > 100)' > > >> > > > > >> > > # Or get list of hosts that have over 100 records in total > > >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr > > >> > > '(numRecords >= 100)' > > >> > > > > >> > > List of fields that are expressible (line 93-104): > > >> > > > > >> > > > > >> > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup > > >> > > > > >> > > You now have a list of hostnames that you can using with the > > >> > > domainblacklist-urlfilter at generate step. > > >> > > > > >> > > Markus > > >> > > > > >> > > > > >> > > -----Original message----- > > >> > > > From:Tomasz <polish.software.develo...@gmail.com> > > >> > > > Sent: Wednesday 24th February 2016 11:30 > > >> > > > To: user@nutch.apache.org > > >> > > > Subject: Limit number of pages per host/domain > > >> > > > > > >> > > > Hello, > > >> > > > > > >> > > > One can set generate.max.count to limit number of urls for > domain > > >> or host > > >> > > > in next fetch step. But is there a way to limit number of > fetched > > >> urls > > >> > > for > > >> > > > domain/host in a whole crawl process? Supposing I run > > >> > > generate/fetch/update > > >> > > > cycle 6 times and want to limit number of urls per host to 100 > urls > > >> > > (pages) > > >> > > > and not more in a whole crawldb. How can I achieve that? > > >> > > > > > >> > > > Regards, > > >> > > > Tomasz > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >