Hi guys,
I'm having some trouble finished jobs that run smoothly on a smaller
dataset, but always fail at 99% if i try to run the job on the whole
set.
i can see a few killed map and a few killed reduce, but quite a lot of
failed reduce tasks that all show the same exception at the end.
here is what i have in the logs:
2010-12-08 08:44:56,127 INFO org.apache.hadoop.mapred.ReduceTask:
Ignoring obsolete output of KILLED map-task:
'attempt_201012080810_0003_m_000009_1'
2010-12-08 08:45:08,152 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201012080810_0003_r_000000_0: Got 1 new map-outputs
2010-12-08 08:45:13,103 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201012080810_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup
hosts)
2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: header:
attempt_201012080810_0003_m_000003_0, compressed len: 3488519, decompressed
len: 3488515
2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling
3488515 bytes (3488519 raw bytes) into RAM from
attempt_201012080810_0003_m_000003_0
2010-12-08 08:45:13,348 INFO org.apache.pig.impl.util.SpillableMemoryManager:
low memory handler called (Collection threshold exceeded) init = 5439488(5312K)
used = 78403496(76565K) committed = 101908480(99520K) max = 139853824(136576K)
2010-12-08 08:45:13,404 INFO org.apache.hadoop.mapred.ReduceTask: Read 3488515
bytes from map-output for attempt_201012080810_0003_m_000003_0
2010-12-08 08:45:13,405 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from
attempt_201012080810_0003_m_000003_0 -> (142, 21) from
ip-10-98-71-195.ec2.internal
2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask:
GetMapEventsThread exiting
2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask:
getMapsEventsThread joined.
2010-12-08 08:45:14,242 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram
manager
2010-12-08 08:45:14,253 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved
on-disk merge complete: 2 files left.
2010-12-08 08:45:14,254 INFO org.apache.hadoop.mapred.ReduceTask: In-memory
merge complete: 64 files left.
2010-12-08 08:45:14,312 INFO org.apache.hadoop.mapred.Merger: Merging 64 sorted
segments
2010-12-08 08:45:14,313 INFO org.apache.hadoop.mapred.Merger: Down to the last
merge-pass, with 64 segments left of total size: 82947024 bytes
2010-12-08 08:45:15,389 INFO org.apache.hadoop.mapred.ReduceTask: Merged 64
segments, 82947024 bytes to disk to satisfy reduce memory limit
2010-12-08 08:45:15,390 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3
files, 214514578 bytes from disk
2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0
segments, 0 bytes from memory into reduce
2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted
segments
2010-12-08 08:45:15,397 INFO org.apache.hadoop.mapred.Merger: Down to the last
merge-pass, with 3 segments left of total size: 214514566 bytes
2010-12-08 08:45:15,489 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader:
Loaded native gpl library
2010-12-08 08:45:15,522 INFO com.hadoop.compression.lzo.LzoCodec: Successfully
loaded & initialized native-lzo library [hadoop-lzo rev
3e7c9dcf0ea0acbde146cb22b236978b344c5525]
2010-12-08 08:45:15,530 INFO
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader
created.
2010-12-08 08:45:15,534 INFO
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader
created.
2010-12-08 08:45:15,544 INFO
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader
created.
2010-12-08 08:45:15,562 INFO
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader
created.
2010-12-08 08:45:15,564 INFO
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader
created.
2010-12-08 08:45:15,568 INFO
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader
created.
2010-12-08 08:45:37,233 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.98.99.197:50010
2010-12-08 08:45:37,235 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_8615551403563164366_3938
2010-12-08 08:45:43,251 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.98.99.197:50010
2010-12-08 08:45:43,251 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_4074920756844442310_4023
2010-12-08 08:45:49,282 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.100.226.63:50010
2010-12-08 08:45:49,282 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-681320892856427804_4034
2010-12-08 08:45:55,292 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.99.26.80:50010
2010-12-08 08:45:55,292 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-6999793088579291779_4039
2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.io.IOException: Unable to create new block.
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)
2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
for block blk_-6999793088579291779_4039 bad datanode[1] nodes == null
2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
block locations. Source file
"/tmp/temp664356070/tmp-973959386/_temporary/_attempt_201012080810_0003_r_000000_0/winrar/output/extlink/2010-04/2010-04-00000"
- Aborting...
2010-12-08 08:46:01,656 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
org.apache.pig.backend.executionengine.ExecException: ERROR 2135: Received
error from store function.Bad connect ack with firstBadLink 10.99.26.80:50010
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:140)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:423)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:391)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:371)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:239)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: Bad connect ack with firstBadLink
10.99.26.80:50010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2870)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2793)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)
2010-12-08 08:46:01,659 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
cleanup for the task
Any idea on what this problem is and how to make hadoop actually finish
larger datasets?
(i have to admit i have a lot of rather small files (<30MB) as input,
but this appears to happen when the reducer tries to write it's result?
regards,
Johannes