Hi, Houssam:
can you try change your HDFS block size smaller and also 'SET
pig.noSplitCombination false;' in Pig? (so that number of mapper will be
equal to number of file block)

The OOM seems happen in COR function when it is trying to combine different
data chunk together in maper. So more mapper may help. I will try it when I
got a cluster to play with.

Johnny


On Fri, Feb 22, 2013 at 2:18 PM, Johnny Zhang <[email protected]> wrote:

> Hi, Houssam:
> What's the error in your pig log file? I were trying to reproduce it with
> 1000 rows, 500 columns.
> A = load 'random.txt' using PigStorage(':') as
> (f1:double,f2:double,.........,f500:double);
> B = group A all;
> D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499);
> dump D;
>
> The exception in pig log file is
> Backend error message
> ---------------------
> Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded*
> at java.lang.Double.valueOf(Double.java:492)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> at org.apache.pig.builtin.COR.combine(COR.java:258)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
>  at org.apache.pig.backend.hadoop.executionengine.physi
>
> Backend error message
> ---------------------
> Error: java.lang.OutOfMemoryError: Java heap space
>  at java.lang.Double.valueOf(Double.java:492)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex
>
> Backend error message
> ---------------------
> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at java.util.ArrayList.<init>(ArrayList.java:112)
>  at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67)
> at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67)
>  at
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>  at org.apache.pig.builtin.COR$Inte
>
> Backend error message
> ---------------------
> Error: java.lang.OutOfMemoryError: Java heap space
>  at
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
>  at org.apache.pig.backend.hadoop.executionengin
>
> Error message from task (map) task_201302211102_0561_m_000000
> -------------------------------------------------------------
>  ERROR 6016: Out of memory.
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> memory.
> at java.lang.Double.valueOf(Double.java:492)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> at org.apache.pig.builtin.COR.combine(COR.java:258)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> ... 13 more
>
> ================================================================================
> Error message from task (map) task_201302211102_0561_m_000000
> -------------------------------------------------------------
> ERROR 6016: Out of memory.
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> memory.
>  at java.lang.Double.valueOf(Double.java:492)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> ... 13 more
>
> ================================================================================
> Error message from task (map) task_201302211102_0561_m_000000
> -------------------------------------------------------------
>  ERROR 6016: Out of memory.
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> memory.
> at java.util.ArrayList.<init>(ArrayList.java:112)
>  at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67)
> at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67)
>  at
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> ... 13 more
>
> ================================================================================
> Error message from task (map) task_201302211102_0561_m_000000
> -------------------------------------------------------------
> ERROR 6016: Out of memory.
>
> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> memory.
>  at
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
> Caused by: java.lang.OutOfMemoryError: Java heap space
> ... 12 more
>
> ================================================================================
> Pig Stack Trace
> ---------------
> ERROR 6016: Out of memory.
>
> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to
> open iterator for alias D. Backend error : Out of memory.
>  at org.apache.pig.PigServer.openIterator(PigServer.java:826)
> at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
>  at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
> at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
>  at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
>  at org.apache.pig.Main.run(Main.java:538)
> at org.apache.pig.Main.main(Main.java:157)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
> 6016: Out of memory.
> at java.lang.Double.valueOf(Double.java:492)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
>  at
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> at
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> ... 13 more
>
> ================================================================================
>
>
>
> "GC overhead limit exceeded" means too much percentage of the time is
> spent on GC, and too less percentage is recovered. This feature is designed
> to prevent applications from running an extended period of time while
> making little or no progress because the heap is too small.
>
> I tried to disable this in Java by "export
> PIG_OPTS=-D-XX:-UseGCOverheadLimit" to avoid "GC overhead limit exceeded".
> It getting better, but still fail in the end and still can see it got
> thrown at one place. I will see if I can profile the memory usage. No clue
> so far.
>
> Johnny
>
>
>
>
> On Thu, Feb 21, 2013 at 11:39 AM, Houssam H. <[email protected]> wrote:
>
>> Hi,
>>
>> I have a file with a few hundreds of columns with doubles and I am
>> interested in creating a correlation matrix for the columns:
>>
>> A = load 'myData' using PigStorage(':');
>> B = group A all;
>> D = foreach B generate group,COR(A.$0,A.$1,A.$2);
>>
>> For N parameters, the COR function will generate N(N-1)/2 correlations.
>> This is fine as long as N is less than 100: COR(A.$0,A.$1, .... A.$100);
>> However once N is more than 100 or 200 I have an out of memory error (of
>> course this would depend on the amount of RAM you have):
>>
>> 883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR
>> 6016: Out of memory.
>> 893 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map
>> reduce job(s) failed!
>>
>> My file is less than 50Mb, so pig is running all the time with only one
>> mapper.
>>
>> This behavior was the same whether I run the script locally (pig -x
>> local) or on Amazon ElasticMapReduce with multiple instances assigned
>> to the job.
>>
>> Is there a solution to be able to run the correlation function for a
>> big number of parameters?
>>
>> Thank you in advance!
>>
>> -Houssam
>>
>
>

Reply via email to