You may need to change your expressions but it is performant. Not all features of traditional regex are supported. http://wiki.apache.org/nutch/RegexURLFiltersBenchs
-----Original message----- > From:Mohammad wrk <[email protected]> > Sent: Mon 12-Nov-2012 22:17 > To: [email protected] > Subject: Re: very slow generator step > > > > That's a good thinking. I have never used url-filter automation. Where can I > find more info? > > Thanks, > Mohammad > > ________________________________ > From: Julien Nioche <[email protected]> > To: [email protected]; Mohammad wrk <[email protected]> > Sent: Monday, November 12, 2012 12:38:44 PM > Subject: Re: very slow generator step > > Could be that a particularly long and tricky URL got into your crawldb and > put the regex into a spin. I'd use the url-filter automaton instead as it > is much faster. Would be interesting to know what caused the regex to take > so much time, in case you fancy a bit of debugging ;-) > > Julien > > On 12 November 2012 20:29, Mohammad wrk <[email protected]> wrote: > > > Thanks for the tip. It went down to 2 minutes :-) > > > > What I don't understand is that how come everything was working fine with > > the default configuration for about 4 days and all of a sudden one crawl > > causes a jump of 100 minutes? > > > > Cheers, > > Mohammad > > > > > > ________________________________ > > From: Markus Jelsma <[email protected]> > > To: "[email protected]" <[email protected]> > > Sent: Monday, November 12, 2012 11:19:11 AM > > Subject: RE: very slow generator step > > > > Hi - Please use the -noFilter option. It is usually useless to filter in > > the generator because they've already been filtered in the parse step and > > or update step. > > > > > > > > -----Original message----- > > > From:Mohammad wrk <[email protected]> > > > Sent: Mon 12-Nov-2012 18:43 > > > To: [email protected] > > > Subject: very slow generator step > > > > > > Hi, > > > > > > The generator time has gone from 8 minutes to 106 minutes few days ago > > and stayed there since then. AFAIK, I haven't made any configuration > > changes recently (attached you can find some of the configurations that I > > thought might be related). > > > > > > A quick CPU sampling shows that most of the time is spent on > > java.util.regex.Matcher.find(). Since I'm using default regex > > configurations and my crawldb has only 3,052,412 urls, I was wondering if > > this is a known issue with nutch-1.5.1 ? > > > > > > Here are some more information that might help: > > > > > > ===================== Generator logs > > > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: starting at > > 2012-11-09 03:14:50 > > > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: Selecting > > best-scoring urls due for fetch. > > > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: filtering: > > true > > > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: normalizing: > > true > > > 2012-11-09 03:14:50,921 INFO crawl.Generator - Generator: topN: 3000 > > > 2012-11-09 03:14:50,923 INFO crawl.Generator - Generator: jobtracker is > > 'local', generating exactly one partition. > > > 2012-11-09 03:23:39,741 INFO crawl.Generator - Generator: Partitioning > > selected urls for politeness. > > > 2012-11-09 03:23:40,743 INFO crawl.Generator - Generator: segment: > > segments/20121109032340 > > > 2012-11-09 03:23:47,860 INFO crawl.Generator - Generator: finished at > > 2012-11-09 03:23:47, elapsed: 00:08:56 > > > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: starting at > > 2012-11-09 05:35:14 > > > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: Selecting > > best-scoring urls due for fetch. > > > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: filtering: > > true > > > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: normalizing: > > true > > > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: topN: 3000 > > > 2012-11-09 05:35:14,037 INFO crawl.Generator - Generator: jobtracker is > > 'local', generating exactly one partition. > > > 2012-11-09 07:21:42,840 INFO crawl.Generator - Generator: Partitioning > > selected urls for politeness. > > > 2012-11-09 07:21:43,841 INFO crawl.Generator - Generator: segment: > > segments/20121109072143 > > > 2012-11-09 07:21:51,004 INFO crawl.Generator - Generator: finished at > > 2012-11-09 07:21:51, elapsed: 01:46:36 > > > > > > ===================== CrawlDb statistics > > > CrawlDb statistics start: ./crawldb > > > Statistics for CrawlDb: ./crawldb > > > TOTAL urls:3052412 > > > retry 0:3047404 > > > retry 1:338 > > > retry 2:1192 > > > retry 3:822 > > > retry 4:336 > > > retry 5:2320 > > > min score:0.0 > > > avg score:0.015368268 > > > max score:48.608 > > > status 1 (db_unfetched):2813249 > > > status 2 (db_fetched):196717 > > > status 3 (db_gone):14204 > > > status 4 (db_redir_temp):10679 > > > status 5 (db_redir_perm):17563 > > > CrawlDb statistics: done > > > > > > ===================== System info > > > Memory: 4 GB > > > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 > > > Available diskspace: 171.7 GB > > > OS: Release 12.10 (quantal) 64-bit > > > > > > > > > Thanks, > > > Mohammad > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

