RE: Limit number of pages per host/domain

Markus Jelsma Tue, 01 Mar 2016 13:57:18 -0800

Hello Tomasz - see inline.
Markus
 
-----Original message-----
> From:Tomasz <[email protected]>
> Sent: Tuesday 1st March 2016 22:23
> To: [email protected]
> Subject: Re: Limit number of pages per host/domain
> 
> Yes, I'm sure fetchers have the same amount of work. Now, I've just doubled
> fetchers to 128, load raised to 3.6, but still it is a huge change
> comparing to 20.0 (with only 64 threads). There is one difference - I set
> up Nutch 1.11 on compiled bin version downloaded from Apache but with 1.12
> I downloaded the source code and built it myself on the machine using ant
> since there is no bin available yet. Maybe this is all about. Not sure, I
> don't know JVM.


Curious. If i remember correctly Nutch is still versioned to compile with JVM 
7, despite having a Java 8 on your machine. I don't know what would cause the 
difference. I know there are some improvements in 8, but that is just runtime 
and should not matter on the JVM you compile with. Any other contributer or 
commiter here to answer this?

> 
> readhostdb generated a few files. I merged them and saved as
> conf/domainblacklist-urlfilter.txt and the list seems to be reliable.
> Sorry, but I'm not sure what do you mean asking "confirm readhostdb
> generated the output for the filter".

You just confirmed it. A bunch of files emitted that you merged. Seems fine as 
long it contains hostnames (the -hostname switch).

> 
> You're saying that -noFilter is blamed for not using
> domainblacklist-urlfilter.txt That makes sense. Url filtering is active
> only during a parsing step which is done by fetcher (I don't store the
> content). parse.filter.urls = true so, the filtering is disabled on
> generate and update step. What I'm going to do is to run generate step
> without -noFilter and replace regex-urlfilter file with some light regex
> for that moment since my orginal regex-urlfilter is too heavy.

During each step you can use a -Dplugin.includes="..." override. We use this at 
each phase to exactly control which urlfilter is active. 

> 
> Thanks Markus for all your hints :)

Good luck!

> 
> 
> 
> 2016-03-01 20:32 GMT+01:00 Markus Jelsma <[email protected]>:
> 
> > Hello, see inline.
> >
> > Regards,
> > Markus
> >
> > -----Original message-----
> > > From:Tomasz <[email protected]>
> > > Sent: Tuesday 1st March 2016 18:07
> > > To: [email protected]
> > > Subject: Re: Limit number of pages per host/domain
> > >
> > > I've been running Nutch 1.12 for two days (btw. I noticed significant
> > load
> > > drop during fetching comparing to 1.11, it dropped from 20 to 1.5 with 64
> > > fetchers running). Anyway, I tried to use domainblacklist plugin but it
> > > didn't work. This is what I did:
> >
> > That is odd, to my knowledge, nothing we did in Nutch should cause the
> > load to drop so much. Are you sure your fetchers stil have the same amount
> > of work to do as before?
> > >
> > > - I prepared the domain list with update/readhostdb,
> > > - cp domainblacklist-urlfilter.txt to conf/ directory,
> > > - enabled plugin in nutch-site.xml
> > >
> > (<name>plugin.includes</name><value>urlfilter-domainblacklist|protocol-httpclient[....])
> > > - run generate command
> > > bin/nutch generate c1/crawldb c1/segments -topN 50000 -noFilter
> > > - started a fetch step...
> >
> > Did you confirm readhostdb generated the output for the filter? Also, the
> > -noFilter at the generate step disables filtering. Make sure you don't do
> > urlfilter-regex and other heavy filters during generate and updatedb steps.
> > Do it only on updatedb if you have changed the regex config file.
> >
> > >
> > > ...and nutch is still fetching urls from the blacklist. Did I miss
> > > something? Can -noFilter option interfere domainblacklist plugin? I guess
> > > -noFilter refers to regex-urlfilter, am I right? I can only seed in log
> > > that the plugin was properly activated:
> >
> > -noFilter disables all filters. So without it, urlfilter-regex if you have
> > it configured, will run as well, it is too heavy.
> >
> > >
> > > INFO  domainblacklist.DomainBlacklistURLFilter - Attribute "file" is
> > > defined for plugin urlfilter-domainblacklist as
> > > domainblacklist-urlfilter.txt
> >
> > This should be fine.
> >
> > >
> > > Tomasz
> > >
> > >
> > > 2016-02-24 15:48 GMT+01:00 Tomasz <[email protected]>:
> > >
> > > > Oh, great. Will try with 1.12, thanks.
> > > >
> > > > 2016-02-24 15:39 GMT+01:00 Markus Jelsma <[email protected]>:
> > > >
> > > >> Hi - oh crap. I forgot i just committed it to 1.12-SNAPSHOT, it is
> > not in
> > > >> the 1.11 release. You can fetch trunk or NUTCH-1.12-SNAPSHOT for that
> > > >> feature!
> > > >> Markus
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> > From:Tomasz <[email protected]>
> > > >> > Sent: Wednesday 24th February 2016 15:26
> > > >> > To: [email protected]
> > > >> > Subject: Re: Limit number of pages per host/domain
> > > >> >
> > > >> > Thanks a lot Markus. Unfortunately I forgot to mention I use Nutch
> > 1.11
> > > >> but
> > > >> > hostdb works only with 2.x I guess.
> > > >> >
> > > >> > Tomasz
> > > >> >
> > > >> > 2016-02-24 11:53 GMT+01:00 Markus Jelsma <
> > [email protected]>:
> > > >> >
> > > >> > > Hello - this is possible using the HostDB. If you updatehostdb
> > > >> frequently
> > > >> > > you get statistics on number of fetched, redirs, 404's and
> > unfetched
> > > >> for
> > > >> > > any given host. Using readhostdb and a Jexl expression, you can
> > then
> > > >> emit a
> > > >> > > blacklist of hosts that you can use during generate.
> > > >> > >
> > > >> > > # Update the hostdb
> > > >> > > bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb
> > crawl/crawldb/
> > > >> > >
> > > >> > > # Get list of hosts that have over 100 records fetched or not
> > modified
> > > >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> > '(ok >=
> > > >> > > 100)'
> > > >> > >
> > > >> > > # Or get list of hosts that have over 100 records in total
> > > >> > > bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr
> > > >> > > '(numRecords >= 100)'
> > > >> > >
> > > >> > > List of fields that are expressible (line 93-104):
> > > >> > >
> > > >> > >
> > > >>
> > http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup
> > > >> > >
> > > >> > > You now have a list of hostnames that you can using with the
> > > >> > > domainblacklist-urlfilter at generate step.
> > > >> > >
> > > >> > > Markus
> > > >> > >
> > > >> > >
> > > >> > > -----Original message-----
> > > >> > > > From:Tomasz <[email protected]>
> > > >> > > > Sent: Wednesday 24th February 2016 11:30
> > > >> > > > To: [email protected]
> > > >> > > > Subject: Limit number of pages per host/domain
> > > >> > > >
> > > >> > > > Hello,
> > > >> > > >
> > > >> > > > One can set generate.max.count to limit number of urls for
> > domain
> > > >> or host
> > > >> > > > in next fetch step. But is there a way to limit number of
> > fetched
> > > >> urls
> > > >> > > for
> > > >> > > > domain/host in a whole crawl process? Supposing I run
> > > >> > > generate/fetch/update
> > > >> > > > cycle 6 times and want to limit number of urls per host to 100
> > urls
> > > >> > > (pages)
> > > >> > > > and not more in a whole crawldb. How can I achieve that?
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Tomasz
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

RE: Limit number of pages per host/domain

Reply via email to