running bigger pig jobs on amazon ec2

jr Wed, 08 Dec 2010 06:09:57 -0800

Hi guys,
I'm having some trouble finished jobs that run smoothly on a smaller
dataset, but always fail at 99% if i try to run the job on the whole
set.
i can see a few killed map and a few killed reduce, but quite a lot of
failed reduce tasks that all show the same exception at the end.
here is what i have in the logs:


2010-12-08 08:44:56,127 INFO org.apache.hadoop.mapred.ReduceTask:
Ignoring obsolete output of KILLED map-task:
'attempt_201012080810_0003_m_000009_1'
2010-12-08 08:45:08,152 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201012080810_0003_r_000000_0: Got 1 new map-outputs
2010-12-08 08:45:13,103 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201012080810_0003_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup 
hosts)
2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201012080810_0003_m_000003_0, compressed len: 3488519, decompressed 
len: 3488515
2010-12-08 08:45:13,241 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
3488515 bytes (3488519 raw bytes) into RAM from 
attempt_201012080810_0003_m_000003_0
2010-12-08 08:45:13,348 INFO org.apache.pig.impl.util.SpillableMemoryManager: 
low memory handler called (Collection threshold exceeded) init = 5439488(5312K) 
used = 78403496(76565K) committed = 101908480(99520K) max = 139853824(136576K)
2010-12-08 08:45:13,404 INFO org.apache.hadoop.mapred.ReduceTask: Read 3488515 
bytes from map-output for attempt_201012080810_0003_m_000003_0
2010-12-08 08:45:13,405 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
attempt_201012080810_0003_m_000003_0 -> (142, 21) from 
ip-10-98-71-195.ec2.internal
2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask: 
GetMapEventsThread exiting
2010-12-08 08:45:14,241 INFO org.apache.hadoop.mapred.ReduceTask: 
getMapsEventsThread joined.
2010-12-08 08:45:14,242 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram 
manager
2010-12-08 08:45:14,253 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved 
on-disk merge complete: 2 files left.
2010-12-08 08:45:14,254 INFO org.apache.hadoop.mapred.ReduceTask: In-memory 
merge complete: 64 files left.
2010-12-08 08:45:14,312 INFO org.apache.hadoop.mapred.Merger: Merging 64 sorted 
segments
2010-12-08 08:45:14,313 INFO org.apache.hadoop.mapred.Merger: Down to the last 
merge-pass, with 64 segments left of total size: 82947024 bytes
2010-12-08 08:45:15,389 INFO org.apache.hadoop.mapred.ReduceTask: Merged 64 
segments, 82947024 bytes to disk to satisfy reduce memory limit
2010-12-08 08:45:15,390 INFO org.apache.hadoop.mapred.ReduceTask: Merging 3 
files, 214514578 bytes from disk
2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 
segments, 0 bytes from memory into reduce
2010-12-08 08:45:15,392 INFO org.apache.hadoop.mapred.Merger: Merging 3 sorted 
segments
2010-12-08 08:45:15,397 INFO org.apache.hadoop.mapred.Merger: Down to the last 
merge-pass, with 3 segments left of total size: 214514566 bytes
2010-12-08 08:45:15,489 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: 
Loaded native gpl library
2010-12-08 08:45:15,522 INFO com.hadoop.compression.lzo.LzoCodec: Successfully 
loaded & initialized native-lzo library [hadoop-lzo rev 
3e7c9dcf0ea0acbde146cb22b236978b344c5525]
2010-12-08 08:45:15,530 INFO 
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader 
created.
2010-12-08 08:45:15,534 INFO 
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader 
created.
2010-12-08 08:45:15,544 INFO 
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader 
created.
2010-12-08 08:45:15,562 INFO 
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader 
created.
2010-12-08 08:45:15,564 INFO 
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader 
created.
2010-12-08 08:45:15,568 INFO 
com.twitter.elephantbird.pig.load.LzoBaseRegexLoader: LzoBaseRegexLoader 
created.
2010-12-08 08:45:37,233 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.98.99.197:50010
2010-12-08 08:45:37,235 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_8615551403563164366_3938
2010-12-08 08:45:43,251 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.98.99.197:50010
2010-12-08 08:45:43,251 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_4074920756844442310_4023
2010-12-08 08:45:49,282 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.100.226.63:50010
2010-12-08 08:45:49,282 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-681320892856427804_4034
2010-12-08 08:45:55,292 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.99.26.80:50010
2010-12-08 08:45:55,292 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_-6999793088579291779_4039
2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: java.io.IOException: Unable to create new block.
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)

2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block blk_-6999793088579291779_4039 bad datanode[1] nodes == null
2010-12-08 08:46:01,294 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
block locations. Source file 
"/tmp/temp664356070/tmp-973959386/_temporary/_attempt_201012080810_0003_r_000000_0/winrar/output/extlink/2010-04/2010-04-00000"
 - Aborting...
2010-12-08 08:46:01,656 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
org.apache.pig.backend.executionengine.ExecException: ERROR 2135: Received 
error from store function.Bad connect ack with firstBadLink 10.99.26.80:50010
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getNext(POStore.java:140)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.runPipeline(POSplit.java:254)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.processPlan(POSplit.java:236)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSplit.getNext(POSplit.java:228)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:423)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:391)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:371)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:239)
        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: Bad connect ack with firstBadLink 
10.99.26.80:50010
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2870)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2793)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)
2010-12-08 08:46:01,659 INFO org.apache.hadoop.mapred.TaskRunner: Runnning 
cleanup for the task

Any idea on what this problem is and how to make hadoop actually finish
larger datasets?
(i have to admit i have a lot of rather small files (<30MB) as input,
but this appears to happen when the reducer tries to write it's result?

regards,
Johannes

running bigger pig jobs on amazon ec2

Reply via email to