Re: Limit number of pages per host/domain

Tomasz Tue, 01 Mar 2016 13:24:06 -0800

Yes, I'm sure fetchers have the same amount of work. Now, I've just doubled
fetchers to 128, load raised to 3.6, but still it is a huge change
comparing to 20.0 (with only 64 threads). There is one difference - I set
up Nutch 1.11 on compiled bin version downloaded from Apache but with 1.12
I downloaded the source code and built it myself on the machine using ant
since there is no bin available yet. Maybe this is all about. Not sure, I
don't know JVM.


readhostdb generated a few files. I merged them and saved as
conf/domainblacklist-urlfilter.txt and the list seems to be reliable.
Sorry, but I'm not sure what do you mean asking "confirm readhostdb
generated the output for the filter".

You're saying that -noFilter is blamed for not using
domainblacklist-urlfilter.txt That makes sense. Url filtering is active
only during a parsing step which is done by fetcher (I don't store the
content). parse.filter.urls = true so, the filtering is disabled on
generate and update step. What I'm going to do is to run generate step
without -noFilter and replace regex-urlfilter file with some light regex
for that moment since my orginal regex-urlfilter is too heavy.

Thanks Markus for all your hints :)



2016-03-01 20:32 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>:

> Hello, see inline.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Tomasz <polish.software.develo...@gmail.com>
> > Sent: Tuesday 1st March 2016 18:07
> > To: user@nutch.apache.org
> > Subject: Re: Limit number of pages per host/domain
> >
> > I've been running Nutch 1.12 for two days (btw. I noticed significant
> load
> > drop during fetching comparing to 1.11, it dropped from 20 to 1.5 with 64
> > fetchers running). Anyway, I tried to use domainblacklist plugin but it
> > didn't work. This is what I did:
>
> That is odd, to my knowledge, nothing we did in Nutch should cause the
> load to drop so much. Are you sure your fetchers stil have the same amount
> of work to do as before?
> >
> > - I prepared the domain list with update/readhostdb,
> > - cp domainblacklist-urlfilter.txt to conf/ directory,
> > - enabled plugin in nutch-site.xml
> >
> (<name>plugin.includes</name><value>urlfilter-domainblacklist|protocol-httpclient[....])
> > - run generate command
> > bin/nutch generate c1/crawldb c1/segments -topN 50000 -noFilter
> > - started a fetch step...
>
> Did you confirm readhostdb generated the output for the filter? Also, the
> -noFilter at the generate step disables filtering. Make sure you don't do
> urlfilter-regex and other heavy filters during generate and updatedb steps.
> Do it only on updatedb if you have changed the regex config file.
>
> >
> > ...and nutch is still fetching urls from the blacklist. Did I miss
> > something? Can -noFilter option interfere domainblacklist plugin? I guess
> > -noFilter refers to regex-urlfilter, am I right? I can only seed in log
> > that the plugin was properly activated:
>
> -noFilter disables all filters. So without it, urlfilter-regex if you have
> it configured, will run as well, it is too heavy.
>
> >
> > INFO  domainblacklist.DomainBlacklistURLFilter - Attribute "file" is
> > defined for plugin urlfilter-domainblacklist as
> > domainblacklist-urlfilter.txt
>
> This should be fine.
>
> >
> > Tomasz
> >
> >
> > 2016-02-24 15:48 GMT+01:00 Tomasz <polish.software.develo...@gmail.com>:
> >
> > > Oh, great. Will try with 1.12, thanks.
> > >
> > > 2016-02-24 15:39 GMT+01:00 Markus Jelsma <markus.jel...@openindex.io>:
> > >
> > >> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is
> not in
> > >> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that
> > >> feature!
> > >> Markus
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:Tomasz <polish.software.develo...@gmail.com>
> > >> > Sent: Wednesday 24th February 2016 15:26
> > >> > To: user@nutch.apache.org
> > >> > Subject: Re: Limit number of pages per host/domain
> > >> >
> > >> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch
> 1.11
> > >> but
> > >> > hostdb works only with 2.x I guess.
> > >> >
> > >> > Tomasz
> > >> >
> > >> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <
> markus.jel...@openindex.io>:
> > >> >
> > >> > > Hello - this is possible using the HostDB. If you updatehostdb
> > >> frequently
> > >> > > you get statistics on number of fetched, redirs, 404's and
> unfetched
> > >> for
> > >> > > any given host. Using readhostdb and a Jexl expression, you can
> then
> > >> emit a
> > >> > > blacklist of hosts that you can use during generate.
> > >> > >
> > >> > > # Update the hostdb
> > >> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb
> crawl/crawldb/
> > >> > >
> > >> > > # Get list of hosts that have over 100 records fetched or not
> modified
> > >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> '(ok >=
> > >> > > 100)'
> > >> > >
> > >> > > # Or get list of hosts that have over 100 records in total
> > >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> > >> > > '(numRecords >= 100)'
> > >> > >
> > >> > > List of fields that are expressible (line 93-104):
> > >> > >
> > >> > >
> > >>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
> > >> > >
> > >> > > You now have a list of hostnames that you can using with the
> > >> > > domainblacklist-urlfilter at generate step.
> > >> > >
> > >> > > Markus
> > >> > >
> > >> > >
> > >> > > -----Original message-----
> > >> > > > From:Tomasz <polish.software.develo...@gmail.com>
> > >> > > > Sent: Wednesday 24th February 2016 11:30
> > >> > > > To: user@nutch.apache.org
> > >> > > > Subject: Limit number of pages per host/domain
> > >> > > >
> > >> > > > Hello,
> > >> > > >
> > >> > > > One can set generate.max.count to limit number of urls for
> domain
> > >> or host
> > >> > > > in next fetch step. But is there a way to limit number of
> fetched
> > >> urls
> > >> > > for
> > >> > > > domain/host in a whole crawl process? Supposing I run
> > >> > > generate/fetch/update
> > >> > > > cycle 6 times and want to limit number of urls per host to 100
> urls
> > >> > > (pages)
> > >> > > > and not more in a whole crawldb. How can I achieve that?
> > >> > > >
> > >> > > > Regards,
> > >> > > > Tomasz
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Limit number of pages per host/domain

Reply via email to