Looks like there is a bug that needs fixing. Can you open a jira with the details? Please add information including - the io.sort.mb setting, -Xmx for map task, information about udfs you are using.

As a workaround, you can spawn more map tasks - turn split combination off and specify smaller split size .
For example -
-Dpig.splitCombination=false -Dmapred.max.split.size=33554432
Let me know if the workaround works for you.

-Thejas


On 10/11/11 7:47 AM, Shubham Chopra wrote:
Hi Thejas,

I am using 0.9. What I see is that the members POForeach and
myPlans:ArrayList of PODemux seem to be keeping two deep copies of the same
set of databags.

Thanks,
Shubham.

On Mon, Oct 10, 2011 at 6:57 PM, Thejas Nair<[email protected]>  wrote:

What version of pig are you using ? You might want to try 0.9.1 .
This sounds like the issue described in - https://issues.apache.org/**
jira/browse/PIG-1815<https://issues.apache.org/jira/browse/PIG-1815>  .

Thanks,
Thejas


On 10/10/11 2:22 PM, Shubham Chopra wrote:

The job I am trying to run performs some projections and aggregations. I
see
that maps continuously fail with an OOM with the following stack trace:

Error: java.lang.OutOfMemoryError: Java heap space
        at org.apache.pig.data.**DefaultTuple.(DefaultTuple.**java:69)
        at org.apache.pig.data.**BinSedesTuple.(BinSedesTuple.**java:82)
        at org.apache.pig.data.**BinSedesTupleFactory.newTuple(**
BinSedesTupleFactory.java:38)
        at org.apache.pig.data.**BinInterSedes.readTuple(**
BinInterSedes.java:109)
        at org.apache.pig.data.**BinInterSedes.readDatum(**
BinInterSedes.java:270)
        at org.apache.pig.data.**BinInterSedes.readDatum(**
BinInterSedes.java:251)
        at org.apache.pig.data.**BinInterSedes.addColsToTuple(**
BinInterSedes.java:556)
        at org.apache.pig.data.**BinSedesTuple.readFields(**
BinSedesTuple.java:64)
        at org.apache.pig.impl.io.**PigNullableWritable.**readFields(**
PigNullableWritable.java:114)
        at org.apache.hadoop.io.**serializer.**WritableSerialization$**
WritableDeserializer.**deserialize(**WritableSerialization.java:67)
        at org.apache.hadoop.io.**serializer.**WritableSerialization$**
WritableDeserializer.**deserialize(**WritableSerialization.java:40)
        at org.apache.hadoop.mapreduce.**ReduceContext.nextKeyValue(**
ReduceContext.java:116)
        at org.apache.hadoop.mapreduce.**ReduceContext$ValueIterator.**
next(ReduceContext.java:163)
        at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
*relationalOperators.**POCombinerPackage.getNext(**
POCombinerPackage.java:141)
        at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.*
*relationalOperators.**POMultiQueryPackage.getNext(**
POMultiQueryPackage.java:238)
        at org.apache.pig.backend.hadoop.**executionengine.**
mapReduceLayer.PigCombiner$**Combine.**processOnePackageOutput(**
PigCombiner.java:171)
        at org.apache.pig.backend.hadoop.**executionengine.**
mapReduceLayer.PigCombiner$**Combine.reduce(PigCombiner.**java:162)
        at org.apache.pig.backend.hadoop.**executionengine.**
mapReduceLayer.PigCombiner$**Combine.reduce(PigCombiner.**java:51)
        at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176)
        at org.apache.hadoop.mapred.Task$**NewCombinerRunner.combine(**
Task.java:1222)
        at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
sortAndSpill(MapTask.java:**1265)
        at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
access$1800(MapTask.java:686)
        at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$**
SpillThread.run(MapTask.java:**1173)


An analysis of the heapdump showed that apart from the io sort buffer, the
remaining memory was being consumed almost in its entirety by
org.apache.pig.backend.hadoop.**executionengine.physicalLayer.**
relationalOperators.PODemux
(predominantly by an ArrayList, and a POForeach)

Should the combiner usage be causing this high memory consumption? Is
there
any way to make the combiner run more frequently and aggregate the data
more
aggressively? The data I am using reduces by a factor of at least 1:10
after
the combiner step and is neatly partitioned to maximize the effectiveness
of
combiner.

Thanks,
Shubham.





Reply via email to