Hi James - use the -noFilter and -noNormalize switches and you'll get your first massive performance improvement.
M. -----Original message----- > From:James Mardell <[email protected]> > Sent: Wednesday 22nd June 2016 17:18 > To: [email protected] > Subject: Nutch generate slowdown > > We currently run a Nutch instance across a 7 node Hadoop cluster (280 > threads). Our generate job used to take an hour to run, now it takes > ~3 hours with no configuration changes. > > When the generate job is run, 350 out of 400 tasks take 10–20 minutes > to complete. The remaining 50 then take >90 minutes. Inspecting the > tasks, there are no blatant exceptions or suchlike, however: > * The "File System Counters" for these 50 tasks show a count of zero > for "FILE: Number of bytes read" unlike the other 350 tasks which have > normal looking counts. > * The status for these long tasks read > "hdfs://production/user/ubuntu/crawls-blargh/crawldb/current/part-{{n}}/data:268435456+134217728" > where {{n}} is 0–49. What are these "data:x+y" numbers? Offsets? > Magic? > > Any advice on how to further diagnose this slowdown would be appreciated. > > Our generate command is: > bin/nutch generate -D mapred.child.java.opts=-Xmx1000m -D > mapred.map.tasks.speculative=false -D > mapred.reduce.tasks.speculative=false -D > mapred.map.output.compress=true -Dgenerate.max.count=10000 -D > mapred.reduce.tasks=100 crawls-blargh/crawldb crawls-blargh/segments > -numFetchers 19 > > Thanks, > > Dr. James Mardell > Developer > [email protected] > > Arachnys — Instant worldwide due diligence >

