We currently run a Nutch instance across a 7 node Hadoop cluster (280
threads). Our generate job used to take an hour to run, now it takes
~3 hours with no configuration changes.

When the generate job is run, 350 out of 400 tasks take 10–20 minutes
to complete. The remaining 50 then take >90 minutes. Inspecting the
tasks, there are no blatant exceptions or suchlike, however:
* The "File System Counters" for these 50 tasks show a count of zero
for "FILE: Number of bytes read" unlike the other 350 tasks which have
normal looking counts.
* The status for these long tasks read
"hdfs://production/user/ubuntu/crawls-blargh/crawldb/current/part-{{n}}/data:268435456+134217728"
where {{n}} is 0–49. What are these "data:x+y" numbers? Offsets?
Magic?

Any advice on how to further diagnose this slowdown would be appreciated.

Our generate command is:
bin/nutch generate -D mapred.child.java.opts=-Xmx1000m -D
mapred.map.tasks.speculative=false -D
mapred.reduce.tasks.speculative=false -D
mapred.map.output.compress=true -Dgenerate.max.count=10000 -D
mapred.reduce.tasks=100 crawls-blargh/crawldb crawls-blargh/segments
-numFetchers 19

Thanks,

Dr. James Mardell
Developer
[email protected]

Arachnys — Instant worldwide due diligence

Reply via email to