I've been running Nutch 1.12 for two days (btw. I noticed significant load drop during fetching comparing to 1.11, it dropped from 20 to 1.5 with 64 fetchers running). Anyway, I tried to use domainblacklist plugin but it didn't work. This is what I did:
- I prepared the domain list with update/readhostdb, - cp domainblacklist-urlfilter.txt to conf/ directory, - enabled plugin in nutch-site.xml (<name>plugin.includes</name><value>urlfilter-domainblacklist|protocol-httpclient[....]) - run generate command bin/nutch generate c1/crawldb c1/segments -topN 50000 -noFilter - started a fetch step... ...and nutch is still fetching urls from the blacklist. Did I miss something? Can -noFilter option interfere domainblacklist plugin? I guess -noFilter refers to regex-urlfilter, am I right? I can only seed in log that the plugin was properly activated: INFO domainblacklist.DomainBlacklistURLFilter - Attribute "file" is defined for plugin urlfilter-domainblacklist as domainblacklist-urlfilter.txt Tomasz 2016-02-24 15:48 GMT+01:00 Tomasz <[email protected]>: > Oh, great. Will try with 1.12, thanks. > > 2016-02-24 15:39 GMT+01:00 Markus Jelsma <[email protected]>: > >> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is not in >> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that >> feature! >> Markus >> >> >> >> -----Original message----- >> > From:Tomasz <[email protected]> >> > Sent: Wednesday 24th February 2016 15:26 >> > To: [email protected] >> > Subject: Re: Limit number of pages per host/domain >> > >> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch 1.11 >> but >> > hostdb works only with 2.x I guess. >> > >> > Tomasz >> > >> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <[email protected]>: >> > >> > > Hello - this is possible using the HostDB. If you updatehostdb >> frequently >> > > you get statistics on number of fetched, redirs, 404's and unfetched >> for >> > > any given host. Using readhostdb and a Jexl expression, you can then >> emit a >> > > blacklist of hosts that you can use during generate. >> > > >> > > # Update the hostdb >> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/ >> > > >> > > # Get list of hosts that have over 100 records fetched or not modified >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >= >> > > 100)' >> > > >> > > # Or get list of hosts that have over 100 records in total >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr >> > > '(numRecords >= 100)' >> > > >> > > List of fields that are expressible (line 93-104): >> > > >> > > >> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup >> > > >> > > You now have a list of hostnames that you can using with the >> > > domainblacklist-urlfilter at generate step. >> > > >> > > Markus >> > > >> > > >> > > -----Original message----- >> > > > From:Tomasz <[email protected]> >> > > > Sent: Wednesday 24th February 2016 11:30 >> > > > To: [email protected] >> > > > Subject: Limit number of pages per host/domain >> > > > >> > > > Hello, >> > > > >> > > > One can set generate.max.count to limit number of urls for domain >> or host >> > > > in next fetch step. But is there a way to limit number of fetched >> urls >> > > for >> > > > domain/host in a whole crawl process? Supposing I run >> > > generate/fetch/update >> > > > cycle 6 times and want to limit number of urls per host to 100 urls >> > > (pages) >> > > > and not more in a whole crawldb. How can I achieve that? >> > > > >> > > > Regards, >> > > > Tomasz >> > > > >> > > >> > >> > >

