The job I am trying to run performs some projections and aggregations. I see
that maps continuously fail with an OOM with the following stack trace:

Error: java.lang.OutOfMemoryError: Java heap space
        at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:69)
        at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:82)
        at 
org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
        at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:109)
        at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:270)
        at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
        at 
org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:556)
        at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
        at 
org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
        at 
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:141)
        at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:238)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:171)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)


An analysis of the heapdump showed that apart from the io sort buffer, the
remaining memory was being consumed almost in its entirety by
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux
(predominantly by an ArrayList, and a POForeach)

Should the combiner usage be causing this high memory consumption? Is there
any way to make the combiner run more frequently and aggregate the data more
aggressively? The data I am using reduces by a factor of at least 1:10 after
the combiner step and is neatly partitioned to maximize the effectiveness of
combiner.

Thanks,
Shubham.

Reply via email to