Hi all

I've crawled a site down to the last 30 pages with status (db_unfetched).
Many of those are problem pages (bad URLs and such) but a few have valuable
content.

$ nutch readdb crawldb/ -stats

CrawlDb statistics start: crawldb/
Statistics for CrawlDb: crawldb/
TOTAL urls:     10117
retry 0:        10081
retry 1:        29
retry 2:        2
retry 3:        5
min score:      0.0
avg score:      3.0750223E-4
max score:      1.092
status 1 (db_unfetched):        31
status 2 (db_fetched):  9688
status 3 (db_gone):     159
status 4 (db_redir_temp):       3
status 5 (db_redir_perm):       82
status 7 (db_duplicate):        154
CrawlDb statistics: done


The problem is: when I run another iteration of crawling with the
nutch/bin/crawl script (which was packaged with the nutch install) it only
crawls two pages at a time.. I have tried setting the generate -topN
parameter to -1 and removing that parameter all together.

Generating a new segment
nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true ./crawldb ./segments -numFetchers 1
-noFilter
Generator: starting at 2015-12-08 16:58:15
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: Partitioning selected urls for politeness.
Generator: segment: segments/20151208165817
Generator: finished at 2015-12-08 16:58:18, elapsed: 00:00:03
Operating on segment : 20151208165817
Fetching : 20151208165817
nutch/bin/nutch fetch -D mapred.reduce.tasks=2 -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=-1
./segments/20151208165817 -noParsing -threads 50
Fetcher: starting at 2015-12-08 16:58:19
Fetcher: segment: segments/20151208165817
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost

Anyone know of another setting that would allow me to override this
behavior and force Nutch to fetch the last 30 URLs at more than two at a
time?

Thanks for any help.

Scott

Reply via email to