Thanks, Sebastian. I'll take a look at that setting. I had setup my Nutch regex-urlfilter.txt file to limit to only URLs from one site. i.e.:
+^http://wiki.apache.org/* hoping to limit to pages only from one host.. I'll try the settings you suggested. Scott On Wed, Dec 9, 2015 at 1:51 PM, Sebastian Nagel <[email protected]> wrote: > Hi, > > > only crawls 2 URLs at a time > > Sounds like the site has pages from two different hosts > (by URL). There are a couple of properties to adjust > the load on a single host. Have a look at conf/nutch-default.xml, > the property "fetcher.threads.per.queue" and the properties > nearby. > > Cheers, > Sebastian > > > On 12/09/2015 01:32 AM, Jeffery, Scott wrote: > > Hi all > > > > I've crawled a site down to the last 30 pages with status (db_unfetched). > > Many of those are problem pages (bad URLs and such) but a few have > valuable > > content. > > > > $ nutch readdb crawldb/ -stats > > > > CrawlDb statistics start: crawldb/ > > Statistics for CrawlDb: crawldb/ > > TOTAL urls: 10117 > > retry 0: 10081 > > retry 1: 29 > > retry 2: 2 > > retry 3: 5 > > min score: 0.0 > > avg score: 3.0750223E-4 > > max score: 1.092 > > status 1 (db_unfetched): 31 > > status 2 (db_fetched): 9688 > > status 3 (db_gone): 159 > > status 4 (db_redir_temp): 3 > > status 5 (db_redir_perm): 82 > > status 7 (db_duplicate): 154 > > CrawlDb statistics: done > > > > > > The problem is: when I run another iteration of crawling with the > > nutch/bin/crawl script (which was packaged with the nutch install) it > only > > crawls two pages at a time.. I have tried setting the generate -topN > > parameter to -1 and removing that parameter all together. > > > > Generating a new segment > > nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D > > mapred.child.java.opts=-Xmx1000m -D > > mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true ./crawldb ./segments -numFetchers 1 > > -noFilter > > Generator: starting at 2015-12-08 16:58:15 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: false > > Generator: normalizing: true > > Generator: Partitioning selected urls for politeness. > > Generator: segment: segments/20151208165817 > > Generator: finished at 2015-12-08 16:58:18, elapsed: 00:00:03 > > Operating on segment : 20151208165817 > > Fetching : 20151208165817 > > nutch/bin/nutch fetch -D mapred.reduce.tasks=2 -D > > mapred.child.java.opts=-Xmx1000m -D > > mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true -D fetcher.timelimit.mins=-1 > > ./segments/20151208165817 -noParsing -threads 50 > > Fetcher: starting at 2015-12-08 16:58:19 > > Fetcher: segment: segments/20151208165817 > > Using queue mode : byHost > > Fetcher: threads: 50 > > Fetcher: time-out divisor: 2 > > QueueFeeder finished: total 2 records + hit by time limit :0 > > Using queue mode : byHost > > > > Anyone know of another setting that would allow me to override this > > behavior and force Nutch to fetch the last 30 URLs at more than two at a > > time? > > > > Thanks for any help. > > > > Scott > > > >

