Greetings Nutchlings,
I would like to make my generate jobs go faster, and I see that the reducer 
spills a lot of records.
Here are the numbers for a typical long-running reduce task of the 
generate-select job: 100 million spilled records, 255K input records, 90k 
output records, 13G file bytes written, only 3G committed heap usage. 
mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
Do I have to increase  mapreduce.reduce.java.opts and 
mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? 
Also, are there other settings changes needed?
My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  -D 
mapreduce.job.reduces=16 -D 
mapreduce.input.fileinputformat.split.minsize=536870912 -D 
mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D 
db.fetch.interval.default=5184000 -D 
db.fetch.schedule.adaptive.min_interval=3888000 -D generate.update.crawldb=true 
 -D generate.max.count=25 /crawls/popular/data/crawldb 
/crawls/popular/data/segments/ -topN 60000 -numFetchers 2 -noFilter 
-maxNumSegments 24



Reply via email to