You may need to change your expressions but it is performant. Not all features 
of traditional regex are supported.
http://wiki.apache.org/nutch/RegexURLFiltersBenchs

 
 
-----Original message-----
> From:Mohammad wrk <[email protected]>
> Sent: Mon 12-Nov-2012 22:17
> To: [email protected]
> Subject: Re: very slow generator step
> 
> 
> 
> That's a good thinking. I have never used url-filter automation. Where can I 
> find more info?
> 
> Thanks,
> Mohammad
> 
> ________________________________
>  From: Julien Nioche <[email protected]>
> To: [email protected]; Mohammad wrk <[email protected]> 
> Sent: Monday, November 12, 2012 12:38:44 PM
> Subject: Re: very slow generator step
>  
> Could be that a particularly long and tricky URL got into your crawldb and
> put the regex into a spin. I'd use the url-filter automaton instead as it
> is much faster. Would be interesting to know what caused the regex to take
> so much time, in case you fancy a bit of debugging ;-)
> 
> Julien
> 
> On 12 November 2012 20:29, Mohammad wrk <[email protected]> wrote:
> 
> > Thanks for the tip. It went down to 2 minutes :-)
> >
> > What I don't understand is that how come everything was working fine with
> > the default configuration for about 4 days and all of a sudden one crawl
> > causes a jump of 100 minutes?
> >
> > Cheers,
> > Mohammad
> >
> >
> > ________________________________
> >  From: Markus Jelsma <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Sent: Monday, November 12, 2012 11:19:11 AM
> > Subject: RE: very slow generator step
> >
> > Hi - Please use the -noFilter option. It is usually useless to filter in
> > the generator because they've already been filtered in the parse step and
> > or update step.
> >
> >
> >
> > -----Original message-----
> > > From:Mohammad wrk <[email protected]>
> > > Sent: Mon 12-Nov-2012 18:43
> > > To: [email protected]
> > > Subject: very slow generator step
> > >
> > > Hi,
> > >
> > > The generator time has gone from 8 minutes to 106 minutes few days ago
> > and stayed there since then. AFAIK, I haven't made any configuration
> > changes recently (attached you can find some of the configurations that I
> > thought might be related).
> > >
> > > A quick CPU sampling shows that most of the time is spent on
> > java.util.regex.Matcher.find(). Since I'm using default regex
> > configurations and my crawldb has only 3,052,412 urls, I was wondering if
> > this is a known issue with nutch-1.5.1 ?
> > >
> > > Here are some more information that might help:
> > >
> > > ===================== Generator logs
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at
> > 2012-11-09 03:14:50
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering:
> > true
> > > 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing:
> > true
> > > 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> > > 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > > 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > > 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment:
> > segments/20121109032340
> > > 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at
> > 2012-11-09 03:23:47, elapsed: 00:08:56
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at
> > 2012-11-09 05:35:14
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting
> > best-scoring urls due for fetch.
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering:
> > true
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing:
> > true
> > > 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> > > 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is
> > 'local', generating exactly one partition.
> > > 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning
> > selected urls for politeness.
> > > 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment:
> > segments/20121109072143
> > > 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at
> > 2012-11-09 07:21:51, elapsed: 01:46:36
> > >
> > > ===================== CrawlDb statistics
> > > CrawlDb statistics start: ./crawldb
> > > Statistics for CrawlDb: ./crawldb
> > > TOTAL urls:3052412
> > > retry 0:3047404
> > > retry 1:338
> > > retry 2:1192
> > > retry 3:822
> > > retry 4:336
> > > retry 5:2320
> > > min score:0.0
> > > avg score:0.015368268
> > > max score:48.608
> > > status 1 (db_unfetched):2813249
> > > status 2 (db_fetched):196717
> > > status 3 (db_gone):14204
> > > status 4 (db_redir_temp):10679
> > > status 5 (db_redir_perm):17563
> > > CrawlDb statistics: done
> > >
> > > ===================== System info
> > > Memory: 4 GB
> > > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4
> > > Available diskspace: 171.7 GB
> > > OS: Release 12.10 (quantal) 64-bit
> > >
> > >
> > > Thanks,
> > > Mohammad
> > >
> >
> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Reply via email to