If you were observing low performance with the urlfilter-regex, directly
switching to url-filter automation may or may not
help. As Julien pointed out, bad performance might be accounted to some
nasty urls which consume lot of time. To check this you can run
urlfilter-regex plugin as a standalone plugin (
http://wiki.apache.org/nutch/bin/nutch%20plugin) and pass all the urls to
it. With a minor tweak you can dump the time taken for each url. If you are
sure that the low perf is not due to nasty urls, switching to url-filter
automation will be best thing to do. You must carefully design the rules in
automaton-urlfilter.txt as it has limited capability.

Even crawlspace expansion could be a reason ie. nutch found a huge number
of urls all of a sudden. This had happened with me when nutch crawled
sitemap pages which had enormous outlinks. This can be checked by observing
the fetched count for the earlier rounds and the recent round.

On top of everything, I agree with what Markus suggested. ie. using
-noFilter option for generate. It gives good perf.
Update phase is already preventing unwanted urls being added. So no need to
do filtering again in generate (unless you want to do custom crawling of
some specific hosts or urls and quickly get it data).

thanks,
Tejas


On Mon, Nov 12, 2012 at 1:21 PM, Markus Jelsma
<[email protected]>wrote:

> You may need to change your expressions but it is performant. Not all
> features of traditional regex are supported.
> http://wiki.apache.org/nutch/RegexURLFiltersBenchs
>
>
>
> -----Original message-----
> > From:Mohammad wrk <[email protected]>
> > Sent: Mon 12-Nov-2012 22:17
> > To: [email protected]
> > Subject: Re: very slow generator step
> >
> >
> >
> > That's a good thinking. I have never used url-filter automation. Where
> can I find more info?
> >
> > Thanks,
> > Mohammad
> >
> > ________________________________
> >  From: Julien Nioche <[email protected]>
> > To: [email protected]; Mohammad wrk <[email protected]>
> > Sent: Monday, November 12, 2012 12:38:44 PM
> > Subject: Re: very slow generator step
> >
> > Could be that a particularly long and tricky URL got into your crawldb
> and
> > put the regex into a spin. I'd use the url-filter automaton instead as it
> > is much faster. Would be interesting to know what caused the regex to
> take
> > so much time, in case you fancy a bit of debugging ;-)
> >
> > Julien
> >
> > On 12 November 2012 20:29, Mohammad wrk <[email protected]> wrote:
> >
> > > Thanks for the tip. It went down to 2 minutes :-)
> > >
> > > What I don't understand is that how come everything was working fine
> with
> > > the default configuration for about 4 days and all of a sudden one
> crawl
> > > causes a jump of 100 minutes?
> > >
> > > Cheers,
> > > Mohammad
> > >
> > >
> > > ________________________________
> > >  From: Markus Jelsma <[email protected]>
> > > To: "[email protected]" <[email protected]>
> > > Sent: Monday, November 12, 2012 11:19:11 AM
> > > Subject: RE: very slow generator step
> > >
> > > Hi - Please use the -noFilter option. It is usually useless to filter
> in
> > > the generator because they've already been filtered in the parse step
> and
> > > or update step.
> > >
> > >
> > >
> > > -----Original message-----
> > > > From:Mohammad wrk <[email protected]>
> > > > Sent: Mon 12-Nov-2012 18:43
> > > > To: [email protected]
> > > > Subject: very slow generator step
> > > >
> > > > Hi,
> > > >
> > > > The generator time has gone from 8 minutes to 106 minutes few days
> ago
> > > and stayed there since then. AFAIK, I haven't made any configuration
> > > changes recently (attached you can find some of the configurations
> that I
> > > thought might be related).
> > > >
> > > > A quick CPU sampling shows that most of the time is spent on
> > > java.util.regex.Matcher.find(). Since I'm using default regex
> > > configurations and my crawldb has only 3,052,412 urls, I was wondering
> if
> > > this is a known issue with nutch-1.5.1 ?
> > > >
> > > > Here are some more information that might help:
> > > >
> > > > ===================== Generator logs
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting
> at
> > > 2012-11-09 03:14:50
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> > > best-scoring urls due for fetch.
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> > > true
> > > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator:
> normalizing:
> > > true
> > > > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > > > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator:
> jobtracker is
> > > 'local', generating exactly one partition.
> > > > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator:
> Partitioning
> > > selected urls for politeness.
> > > > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> > > segments/20121109032340
> > > > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished
> at
> > > 2012-11-09 03:23:47, elapsed: 00:08:56
> > > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting
> at
> > > 2012-11-09 05:35:14
> > > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> > > best-scoring urls due for fetch.
> > > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> > > true
> > > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator:
> normalizing:
> > > true
> > > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > > > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator:
> jobtracker is
> > > 'local', generating exactly one partition.
> > > > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator:
> Partitioning
> > > selected urls for politeness.
> > > > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> > > segments/20121109072143
> > > > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished
> at
> > > 2012-11-09 07:21:51, elapsed: 01:46:36
> > > >
> > > > ===================== CrawlDb statistics
> > > > CrawlDb statistics start: ./crawldb
> > > > Statistics for CrawlDb: ./crawldb
> > > > TOTAL urls:3052412
> > > > retry 0:3047404
> > > > retry 1:338
> > > > retry 2:1192
> > > > retry 3:822
> > > > retry 4:336
> > > > retry 5:2320
> > > > min score:0.0
> > > > avg score:0.015368268
> > > > max score:48.608
> > > > status 1 (db_unfetched):2813249
> > > > status 2 (db_fetched):196717
> > > > status 3 (db_gone):14204
> > > > status 4 (db_redir_temp):10679
> > > > status 5 (db_redir_perm):17563
> > > > CrawlDb statistics: done
> > > >
> > > > ===================== System info
> > > > Memory: 4 GB
> > > > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
> > > > Available diskspace: 171.7 GB
> > > > OS: Release 12.10 (quantal) 64-bit
> > > >
> > > >
> > > > Thanks,
> > > > Mohammad
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
>

Reply via email to