Re: Nutch only crawls 2 URLs at a time

Sebastian Nagel Wed, 09 Dec 2015 12:52:26 -0800

Hi,

> only crawls 2 URLs at a time


Sounds like the site has pages from two different hosts
(by URL). There are a couple of properties to adjust
the load on a single host. Have a look at conf/nutch-default.xml,
the property "fetcher.threads.per.queue" and the properties
nearby.

Cheers,
Sebastian


On 12/09/2015 01:32 AM, Jeffery, Scott wrote:
> Hi all
> 
> I've crawled a site down to the last 30 pages with status (db_unfetched).
> Many of those are problem pages (bad URLs and such) but a few have valuable
> content.
> 
> $ nutch readdb crawldb/ -stats
> 
> CrawlDb statistics start: crawldb/
> Statistics for CrawlDb: crawldb/
> TOTAL urls:     10117
> retry 0:        10081
> retry 1:        29
> retry 2:        2
> retry 3:        5
> min score:      0.0
> avg score:      3.0750223E-4
> max score:      1.092
> status 1 (db_unfetched):        31
> status 2 (db_fetched):  9688
> status 3 (db_gone):     159
> status 4 (db_redir_temp):       3
> status 5 (db_redir_perm):       82
> status 7 (db_duplicate):        154
> CrawlDb statistics: done
> 
> 
> The problem is: when I run another iteration of crawling with the
> nutch/bin/crawl script (which was packaged with the nutch install) it only
> crawls two pages at a time.. I have tried setting the generate -topN
> parameter to -1 and removing that parameter all together.
> 
> Generating a new segment
> nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D
> mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true ./crawldb ./segments -numFetchers 1
> -noFilter
> Generator: starting at 2015-12-08 16:58:15
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: Partitioning selected urls for politeness.
> Generator: segment: segments/20151208165817
> Generator: finished at 2015-12-08 16:58:18, elapsed: 00:00:03
> Operating on segment : 20151208165817
> Fetching : 20151208165817
> nutch/bin/nutch fetch -D mapred.reduce.tasks=2 -D
> mapred.child.java.opts=-Xmx1000m -D
> mapred.reduce.tasks.speculative.execution=false -D
> mapred.map.tasks.speculative.execution=false -D
> mapred.compress.map.output=true -D fetcher.timelimit.mins=-1
> ./segments/20151208165817 -noParsing -threads 50
> Fetcher: starting at 2015-12-08 16:58:19
> Fetcher: segment: segments/20151208165817
> Using queue mode : byHost
> Fetcher: threads: 50
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 2 records + hit by time limit :0
> Using queue mode : byHost
> 
> Anyone know of another setting that would allow me to override this
> behavior and force Nutch to fetch the last 30 URLs at more than two at a
> time?
> 
> Thanks for any help.
> 
> Scott
>

Re: Nutch only crawls 2 URLs at a time

Reply via email to