Hi - Please use the -noFilter option. It is usually useless to filter in the generator because they've already been filtered in the parse step and or update step.
-----Original message----- > From:Mohammad wrk <[email protected]> > Sent: Mon 12-Nov-2012 18:43 > To: [email protected] > Subject: very slow generator step > > Hi, > > The generator time has gone from 8 minutes to 106 minutes few days ago and > stayed there since then. AFAIK, I haven't made any configuration changes > recently (attached you can find some of the configurations that I thought > might be related). > > A quick CPU sampling shows that most of the time is spent on > java.util.regex.Matcher.find(). Since I'm using default regex configurations > and my crawldb has only 3,052,412 urls, I was wondering if this is a known > issue with nutch-1.5.1 ? > > Here are some more information that might help: > > ===================== Generator logs > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: starting at > 2012-11-09 03:14:50 > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: filtering: true > 2012-11-09 03:14:50,920 INFO crawl.Generator - Generator: normalizing: true > 2012-11-09 03:14:50,921 INFO crawl.Generator - Generator: topN: 3000 > 2012-11-09 03:14:50,923 INFO crawl.Generator - Generator: jobtracker is > 'local', generating exactly one partition. > 2012-11-09 03:23:39,741 INFO crawl.Generator - Generator: Partitioning > selected urls for politeness. > 2012-11-09 03:23:40,743 INFO crawl.Generator - Generator: segment: > segments/20121109032340 > 2012-11-09 03:23:47,860 INFO crawl.Generator - Generator: finished at > 2012-11-09 03:23:47, elapsed: 00:08:56 > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: starting at > 2012-11-09 05:35:14 > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: Selecting > best-scoring urls due for fetch. > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: filtering: true > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: normalizing: true > 2012-11-09 05:35:14,033 INFO crawl.Generator - Generator: topN: 3000 > 2012-11-09 05:35:14,037 INFO crawl.Generator - Generator: jobtracker is > 'local', generating exactly one partition. > 2012-11-09 07:21:42,840 INFO crawl.Generator - Generator: Partitioning > selected urls for politeness. > 2012-11-09 07:21:43,841 INFO crawl.Generator - Generator: segment: > segments/20121109072143 > 2012-11-09 07:21:51,004 INFO crawl.Generator - Generator: finished at > 2012-11-09 07:21:51, elapsed: 01:46:36 > > ===================== CrawlDb statistics > CrawlDb statistics start: ./crawldb > Statistics for CrawlDb: ./crawldb > TOTAL urls:3052412 > retry 0:3047404 > retry 1:338 > retry 2:1192 > retry 3:822 > retry 4:336 > retry 5:2320 > min score:0.0 > avg score:0.015368268 > max score:48.608 > status 1 (db_unfetched):2813249 > status 2 (db_fetched):196717 > status 3 (db_gone):14204 > status 4 (db_redir_temp):10679 > status 5 (db_redir_perm):17563 > CrawlDb statistics: done > > ===================== System info > Memory: 4 GB > CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 > Available diskspace: 171.7 GB > OS: Release 12.10 (quantal) 64-bit > > > Thanks, > Mohammad >

