Hi, > only crawls 2 URLs at a time
Sounds like the site has pages from two different hosts (by URL). There are a couple of properties to adjust the load on a single host. Have a look at conf/nutch-default.xml, the property "fetcher.threads.per.queue" and the properties nearby. Cheers, Sebastian On 12/09/2015 01:32 AM, Jeffery, Scott wrote: > Hi all > > I've crawled a site down to the last 30 pages with status (db_unfetched). > Many of those are problem pages (bad URLs and such) but a few have valuable > content. > > $ nutch readdb crawldb/ -stats > > CrawlDb statistics start: crawldb/ > Statistics for CrawlDb: crawldb/ > TOTAL urls: 10117 > retry 0: 10081 > retry 1: 29 > retry 2: 2 > retry 3: 5 > min score: 0.0 > avg score: 3.0750223E-4 > max score: 1.092 > status 1 (db_unfetched): 31 > status 2 (db_fetched): 9688 > status 3 (db_gone): 159 > status 4 (db_redir_temp): 3 > status 5 (db_redir_perm): 82 > status 7 (db_duplicate): 154 > CrawlDb statistics: done > > > The problem is: when I run another iteration of crawling with the > nutch/bin/crawl script (which was packaged with the nutch install) it only > crawls two pages at a time.. I have tried setting the generate -topN > parameter to -1 and removing that parameter all together. > > Generating a new segment > nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D > mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true ./crawldb ./segments -numFetchers 1 > -noFilter > Generator: starting at 2015-12-08 16:58:15 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: false > Generator: normalizing: true > Generator: Partitioning selected urls for politeness. > Generator: segment: segments/20151208165817 > Generator: finished at 2015-12-08 16:58:18, elapsed: 00:00:03 > Operating on segment : 20151208165817 > Fetching : 20151208165817 > nutch/bin/nutch fetch -D mapred.reduce.tasks=2 -D > mapred.child.java.opts=-Xmx1000m -D > mapred.reduce.tasks.speculative.execution=false -D > mapred.map.tasks.speculative.execution=false -D > mapred.compress.map.output=true -D fetcher.timelimit.mins=-1 > ./segments/20151208165817 -noParsing -threads 50 > Fetcher: starting at 2015-12-08 16:58:19 > Fetcher: segment: segments/20151208165817 > Using queue mode : byHost > Fetcher: threads: 50 > Fetcher: time-out divisor: 2 > QueueFeeder finished: total 2 records + hit by time limit :0 > Using queue mode : byHost > > Anyone know of another setting that would allow me to override this > behavior and force Nutch to fetch the last 30 URLs at more than two at a > time? > > Thanks for any help. > > Scott >

