Re: Nutch only crawls 2 URLs at a time

Jeffery, Scott Wed, 09 Dec 2015 14:32:11 -0800

Thanks, Sebastian. I'll take a look at that setting. I had setup my
Nutch regex-urlfilter.txt file to limit to only URLs from one site. i.e.:


+^http://wiki.apache.org/*

hoping to limit to pages only from one host..

I'll try the settings you suggested.

Scott

On Wed, Dec 9, 2015 at 1:51 PM, Sebastian Nagel <[email protected]>
wrote:

> Hi,
>
> > only crawls 2 URLs at a time
>
> Sounds like the site has pages from two different hosts
> (by URL). There are a couple of properties to adjust
> the load on a single host. Have a look at conf/nutch-default.xml,
> the property "fetcher.threads.per.queue" and the properties
> nearby.
>
> Cheers,
> Sebastian
>
>
> On 12/09/2015 01:32 AM, Jeffery, Scott wrote:
> > Hi all
> >
> > I've crawled a site down to the last 30 pages with status (db_unfetched).
> > Many of those are problem pages (bad URLs and such) but a few have
> valuable
> > content.
> >
> > $ nutch readdb crawldb/ -stats
> >
> > CrawlDb statistics start: crawldb/
> > Statistics for CrawlDb: crawldb/
> > TOTAL urls:     10117
> > retry 0:        10081
> > retry 1:        29
> > retry 2:        2
> > retry 3:        5
> > min score:      0.0
> > avg score:      3.0750223E-4
> > max score:      1.092
> > status 1 (db_unfetched):        31
> > status 2 (db_fetched):  9688
> > status 3 (db_gone):     159
> > status 4 (db_redir_temp):       3
> > status 5 (db_redir_perm):       82
> > status 7 (db_duplicate):        154
> > CrawlDb statistics: done
> >
> >
> > The problem is: when I run another iteration of crawling with the
> > nutch/bin/crawl script (which was packaged with the nutch install) it
> only
> > crawls two pages at a time.. I have tried setting the generate -topN
> > parameter to -1 and removing that parameter all together.
> >
> > Generating a new segment
> > nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D
> > mapred.child.java.opts=-Xmx1000m -D
> > mapred.reduce.tasks.speculative.execution=false -D
> > mapred.map.tasks.speculative.execution=false -D
> > mapred.compress.map.output=true ./crawldb ./segments -numFetchers 1
> > -noFilter
> > Generator: starting at 2015-12-08 16:58:15
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: Partitioning selected urls for politeness.
> > Generator: segment: segments/20151208165817
> > Generator: finished at 2015-12-08 16:58:18, elapsed: 00:00:03
> > Operating on segment : 20151208165817
> > Fetching : 20151208165817
> > nutch/bin/nutch fetch -D mapred.reduce.tasks=2 -D
> > mapred.child.java.opts=-Xmx1000m -D
> > mapred.reduce.tasks.speculative.execution=false -D
> > mapred.map.tasks.speculative.execution=false -D
> > mapred.compress.map.output=true -D fetcher.timelimit.mins=-1
> > ./segments/20151208165817 -noParsing -threads 50
> > Fetcher: starting at 2015-12-08 16:58:19
> > Fetcher: segment: segments/20151208165817
> > Using queue mode : byHost
> > Fetcher: threads: 50
> > Fetcher: time-out divisor: 2
> > QueueFeeder finished: total 2 records + hit by time limit :0
> > Using queue mode : byHost
> >
> > Anyone know of another setting that would allow me to override this
> > behavior and force Nutch to fetch the last 30 URLs at more than two at a
> > time?
> >
> > Thanks for any help.
> >
> > Scott
> >
>
>

Re: Nutch only crawls 2 URLs at a time

Reply via email to