Hi all I've crawled a site down to the last 30 pages with status (db_unfetched). Many of those are problem pages (bad URLs and such) but a few have valuable content.
$ nutch readdb crawldb/ -stats CrawlDb statistics start: crawldb/ Statistics for CrawlDb: crawldb/ TOTAL urls: 10117 retry 0: 10081 retry 1: 29 retry 2: 2 retry 3: 5 min score: 0.0 avg score: 3.0750223E-4 max score: 1.092 status 1 (db_unfetched): 31 status 2 (db_fetched): 9688 status 3 (db_gone): 159 status 4 (db_redir_temp): 3 status 5 (db_redir_perm): 82 status 7 (db_duplicate): 154 CrawlDb statistics: done The problem is: when I run another iteration of crawling with the nutch/bin/crawl script (which was packaged with the nutch install) it only crawls two pages at a time.. I have tried setting the generate -topN parameter to -1 and removing that parameter all together. Generating a new segment nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true ./crawldb ./segments -numFetchers 1 -noFilter Generator: starting at 2015-12-08 16:58:15 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: false Generator: normalizing: true Generator: Partitioning selected urls for politeness. Generator: segment: segments/20151208165817 Generator: finished at 2015-12-08 16:58:18, elapsed: 00:00:03 Operating on segment : 20151208165817 Fetching : 20151208165817 nutch/bin/nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=-1 ./segments/20151208165817 -noParsing -threads 50 Fetcher: starting at 2015-12-08 16:58:19 Fetcher: segment: segments/20151208165817 Using queue mode : byHost Fetcher: threads: 50 Fetcher: time-out divisor: 2 QueueFeeder finished: total 2 records + hit by time limit :0 Using queue mode : byHost Anyone know of another setting that would allow me to override this behavior and force Nutch to fetch the last 30 URLs at more than two at a time? Thanks for any help. Scott

