Hi James - use the -noFilter and -noNormalize switches and you'll get your 
first massive performance improvement.

M.

 
 
-----Original message-----
> From:James Mardell <[email protected]>
> Sent: Wednesday 22nd June 2016 17:18
> To: [email protected]
> Subject: Nutch generate slowdown
> 
> We currently run a Nutch instance across a 7 node Hadoop cluster (280
> threads). Our generate job used to take an hour to run, now it takes
> ~3 hours with no configuration changes.
> 
> When the generate job is run, 350 out of 400 tasks take 10–20 minutes
> to complete. The remaining 50 then take >90 minutes. Inspecting the
> tasks, there are no blatant exceptions or suchlike, however:
> * The "File System Counters" for these 50 tasks show a count of zero
> for "FILE: Number of bytes read" unlike the other 350 tasks which have
> normal looking counts.
> * The status for these long tasks read
> "hdfs://production/user/ubuntu/crawls-blargh/crawldb/current/part-{{n}}/data:268435456+134217728"
> where {{n}} is 0–49. What are these "data:x+y" numbers? Offsets?
> Magic?
> 
> Any advice on how to further diagnose this slowdown would be appreciated.
> 
> Our generate command is:
> bin/nutch generate -D mapred.child.java.opts=-Xmx1000m -D
> mapred.map.tasks.speculative=false -D
> mapred.reduce.tasks.speculative=false -D
> mapred.map.output.compress=true -Dgenerate.max.count=10000 -D
> mapred.reduce.tasks=100 crawls-blargh/crawldb crawls-blargh/segments
> -numFetchers 19
> 
> Thanks,
> 
> Dr. James Mardell
> Developer
> [email protected]
> 
> Arachnys — Instant worldwide due diligence
> 

Reply via email to