That's a good thinking. I have never used url-filter automation. Where can I 
find more info?

Thanks,
Mohammad

________________________________
 From: Julien Nioche <[email protected]>
To: [email protected]; Mohammad wrk <[email protected]> 
Sent: Monday, November 12, 2012 12:38:44 PM
Subject: Re: very slow generator step
 
Could be that a particularly long and tricky URL got into your crawldb and
put the regex into a spin. I'd use the url-filter automaton instead as it
is much faster. Would be interesting to know what caused the regex to take
so much time, in case you fancy a bit of debugging ;-)

Julien

On 12 November 2012 20:29, Mohammad wrk <[email protected]> wrote:

> Thanks for the tip. It went down to 2 minutes :-)
>
> What I don't understand is that how come everything was working fine with
> the default configuration for about 4 days and all of a sudden one crawl
> causes a jump of 100 minutes?
>
> Cheers,
> Mohammad
>
>
> ________________________________
>  From: Markus Jelsma <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Monday, November 12, 2012 11:19:11 AM
> Subject: RE: very slow generator step
>
> Hi - Please use the -noFilter option. It is usually useless to filter in
> the generator because they've already been filtered in the parse step and
> or update step.
>
>
>
> -----Original message-----
> > From:Mohammad wrk <[email protected]>
> > Sent: Mon 12-Nov-2012 18:43
> > To: [email protected]
> > Subject: very slow generator step
> >
> > Hi,
> >
> > The generator time has gone from 8 minutes to 106 minutes few days ago
> and stayed there since then. AFAIK, I haven't made any configuration
> changes recently (attached you can find some of the configurations that I
> thought might be related).
> >
> > A quick CPU sampling shows that most of the time is spent on
> java.util.regex.Matcher.find(). Since I'm using default regex
> configurations and my crawldb has only 3,052,412 urls, I was wondering if
> this is a known issue with nutch-1.5.1 ?
> >
> > Here are some more information that might help:
> >
> > ===================== Generator logs
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
> 2012-11-09 03:14:50
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> true
> > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> segments/20121109032340
> > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
> 2012-11-09 03:23:47, elapsed: 00:08:56
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
> 2012-11-09 05:35:14
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> best-scoring urls due for fetch.
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> true
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
> true
> > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
> 'local', generating exactly one partition.
> > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
> selected urls for politeness.
> > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> segments/20121109072143
> > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
> 2012-11-09 07:21:51, elapsed: 01:46:36
> >
> > ===================== CrawlDb statistics
> > CrawlDb statistics start: ./crawldb
> > Statistics for CrawlDb: ./crawldb
> > TOTAL urls:3052412
> > retry 0:3047404
> > retry 1:338
> > retry 2:1192
> > retry 3:822
> > retry 4:336
> > retry 5:2320
> > min score:0.0
> > avg score:0.015368268
> > max score:48.608
> > status 1 (db_unfetched):2813249
> > status 2 (db_fetched):196717
> > status 3 (db_gone):14204
> > status 4 (db_redir_temp):10679
> > status 5 (db_redir_perm):17563
> > CrawlDb statistics: done
> >
> > ===================== System info
> > Memory: 4 GB
> > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
> > Available diskspace: 171.7 GB
> > OS: Release 12.10 (quantal) 64-bit
> >
> >
> > Thanks,
> > Mohammad
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to