The job I am trying to run performs some projections and aggregations. I see
that maps continuously fail with an OOM with the following stack trace:
Error: java.lang.OutOfMemoryError: Java heap space
at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:69)
at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:82)
at
org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
at org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:109)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:270)
at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
at
org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:556)
at org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
at
org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java:114)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
at
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:163)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCombinerPackage.getNext(POCombinerPackage.java:141)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMultiQueryPackage.getNext(POMultiQueryPackage.java:238)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:171)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:162)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1265)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1173)
An analysis of the heapdump showed that apart from the io sort buffer, the
remaining memory was being consumed almost in its entirety by
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.PODemux
(predominantly by an ArrayList, and a POForeach)
Should the combiner usage be causing this high memory consumption? Is there
any way to make the combiner run more frequently and aggregate the data more
aggressively? The data I am using reduces by a factor of at least 1:10 after
the combiner step and is neatly partitioned to maximize the effectiveness of
combiner.
Thanks,
Shubham.