Hi - Please use the -noFilter option. It is usually useless to filter in the 
generator because they've already been filtered in the parse step and or update 
step.

 
 
-----Original message-----
> From:Mohammad wrk <[email protected]>
> Sent: Mon 12-Nov-2012 18:43
> To: [email protected]
> Subject: very slow generator step
> 
> Hi,
> 
> The generator time has gone from 8 minutes to 106 minutes few days ago and 
> stayed there since then. AFAIK, I haven't made any configuration changes 
> recently (attached you can find some of the configurations that I thought 
> might be related). 
> 
> A quick CPU sampling shows that most of the time is spent on 
> java.util.regex.Matcher.find(). Since I'm using default regex configurations 
> and my crawldb has only 3,052,412 urls, I was wondering if this is a known 
> issue with nutch-1.5.1 ?
> 
> Here are some more information that might help:
> 
> ===================== Generator logs
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 03:14:50
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 03:14:50,920 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 03:14:50,921 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 03:14:50,923 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 03:23:39,741 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 03:23:40,743 INFO  crawl.Generator - Generator: segment: 
> segments/20121109032340
> 2012-11-09 03:23:47,860 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 03:23:47, elapsed: 00:08:56
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: starting at 
> 2012-11-09 05:35:14
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: Selecting 
> best-scoring urls due for fetch.
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: filtering: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: normalizing: true
> 2012-11-09 05:35:14,033 INFO  crawl.Generator - Generator: topN: 3000
> 2012-11-09 05:35:14,037 INFO  crawl.Generator - Generator: jobtracker is 
> 'local', generating exactly one partition.
> 2012-11-09 07:21:42,840 INFO  crawl.Generator - Generator: Partitioning 
> selected urls for politeness.
> 2012-11-09 07:21:43,841 INFO  crawl.Generator - Generator: segment: 
> segments/20121109072143
> 2012-11-09 07:21:51,004 INFO  crawl.Generator - Generator: finished at 
> 2012-11-09 07:21:51, elapsed: 01:46:36
> 
> ===================== CrawlDb statistics
> CrawlDb statistics start: ./crawldb
> Statistics for CrawlDb: ./crawldb
> TOTAL urls:3052412
> retry 0:3047404
> retry 1:338
> retry 2:1192
> retry 3:822
> retry 4:336
> retry 5:2320
> min score:0.0
> avg score:0.015368268
> max score:48.608
> status 1 (db_unfetched):2813249
> status 2 (db_fetched):196717
> status 3 (db_gone):14204
> status 4 (db_redir_temp):10679
> status 5 (db_redir_perm):17563
> CrawlDb statistics: done
> 
> ===================== System info
> Memory: 4 GB
> CPUs: Intel® Core™ i3-2310M CPU @ 2.10GHz × 4 
> Available diskspace: 171.7 GB
> OS: Release 12.10 (quantal) 64-bit
> 
> 
> Thanks,
> Mohammad
> 

Reply via email to