Hi Thejas, I am using 0.9. What I see is that the members POForeach and myPlans:ArrayList of PODemux seem to be keeping two deep copies of the same set of databags.
Thanks, Shubham. On Mon, Oct 10, 2011 at 6:57 PM, Thejas Nair <[email protected]> wrote: > What version of pig are you using ? You might want to try 0.9.1 . > This sounds like the issue described in - https://issues.apache.org/** > jira/browse/PIG-1815 <https://issues.apache.org/jira/browse/PIG-1815> . > > Thanks, > Thejas > > > On 10/10/11 2:22 PM, Shubham Chopra wrote: > >> The job I am trying to run performs some projections and aggregations. I >> see >> that maps continuously fail with an OOM with the following stack trace: >> >> Error: java.lang.OutOfMemoryError: Java heap space >> at org.apache.pig.data.**DefaultTuple.(DefaultTuple.**java:69) >> at org.apache.pig.data.**BinSedesTuple.(BinSedesTuple.**java:82) >> at org.apache.pig.data.**BinSedesTupleFactory.newTuple(** >> BinSedesTupleFactory.java:38) >> at org.apache.pig.data.**BinInterSedes.readTuple(** >> BinInterSedes.java:109) >> at org.apache.pig.data.**BinInterSedes.readDatum(** >> BinInterSedes.java:270) >> at org.apache.pig.data.**BinInterSedes.readDatum(** >> BinInterSedes.java:251) >> at org.apache.pig.data.**BinInterSedes.addColsToTuple(** >> BinInterSedes.java:556) >> at org.apache.pig.data.**BinSedesTuple.readFields(** >> BinSedesTuple.java:64) >> at org.apache.pig.impl.io.**PigNullableWritable.**readFields(** >> PigNullableWritable.java:114) >> at org.apache.hadoop.io.**serializer.**WritableSerialization$** >> WritableDeserializer.**deserialize(**WritableSerialization.java:67) >> at org.apache.hadoop.io.**serializer.**WritableSerialization$** >> WritableDeserializer.**deserialize(**WritableSerialization.java:40) >> at org.apache.hadoop.mapreduce.**ReduceContext.nextKeyValue(** >> ReduceContext.java:116) >> at org.apache.hadoop.mapreduce.**ReduceContext$ValueIterator.** >> next(ReduceContext.java:163) >> at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.* >> *relationalOperators.**POCombinerPackage.getNext(** >> POCombinerPackage.java:141) >> at org.apache.pig.backend.hadoop.**executionengine.physicalLayer.* >> *relationalOperators.**POMultiQueryPackage.getNext(** >> POMultiQueryPackage.java:238) >> at org.apache.pig.backend.hadoop.**executionengine.** >> mapReduceLayer.PigCombiner$**Combine.**processOnePackageOutput(** >> PigCombiner.java:171) >> at org.apache.pig.backend.hadoop.**executionengine.** >> mapReduceLayer.PigCombiner$**Combine.reduce(PigCombiner.**java:162) >> at org.apache.pig.backend.hadoop.**executionengine.** >> mapReduceLayer.PigCombiner$**Combine.reduce(PigCombiner.**java:51) >> at org.apache.hadoop.mapreduce.**Reducer.run(Reducer.java:176) >> at org.apache.hadoop.mapred.Task$**NewCombinerRunner.combine(** >> Task.java:1222) >> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.** >> sortAndSpill(MapTask.java:**1265) >> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.** >> access$1800(MapTask.java:686) >> at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$** >> SpillThread.run(MapTask.java:**1173) >> >> >> An analysis of the heapdump showed that apart from the io sort buffer, the >> remaining memory was being consumed almost in its entirety by >> org.apache.pig.backend.hadoop.**executionengine.physicalLayer.** >> relationalOperators.PODemux >> (predominantly by an ArrayList, and a POForeach) >> >> Should the combiner usage be causing this high memory consumption? Is >> there >> any way to make the combiner run more frequently and aggregate the data >> more >> aggressively? The data I am using reduces by a factor of at least 1:10 >> after >> the combiner step and is neatly partitioned to maximize the effectiveness >> of >> combiner. >> >> Thanks, >> Shubham. >> >> >
