Hi, Houssam:
What's the error in your pig log file? I were trying to reproduce it with
1000 rows, 500 columns.
A = load 'random.txt' using PigStorage(':') as
(f1:double,f2:double,.........,f500:double);
B = group A all;
D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499);
dump D;The exception in pig log file is Backend error message --------------------- Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded* at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physi Backend error message --------------------- Error: java.lang.OutOfMemoryError: Java heap space at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex Backend error message --------------------- Error: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.ArrayList.<init>(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67) at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Inte Backend error message --------------------- Error: java.lang.OutOfMemoryError: Java heap space at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) at org.apache.pig.backend.hadoop.executionengin Error message from task (map) task_201302211102_0561_m_000000 ------------------------------------------------------------- ERROR 6016: Out of memory. org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded ... 13 more ================================================================================ Error message from task (map) task_201302211102_0561_m_000000 ------------------------------------------------------------- ERROR 6016: Out of memory. org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) Caused by: java.lang.OutOfMemoryError: Java heap space ... 13 more ================================================================================ Error message from task (map) task_201302211102_0561_m_000000 ------------------------------------------------------------- ERROR 6016: Out of memory. org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at java.util.ArrayList.<init>(ArrayList.java:112) at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67) at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67) at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded ... 13 more ================================================================================ Error message from task (map) task_201302211102_0561_m_000000 ------------------------------------------------------------- ERROR 6016: Out of memory. org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337) Caused by: java.lang.OutOfMemoryError: Java heap space ... 12 more ================================================================================ Pig Stack Trace --------------- ERROR 6016: Out of memory. org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias D. Backend error : Out of memory. at org.apache.pig.PigServer.openIterator(PigServer.java:826) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:538) at org.apache.pig.Main.main(Main.java:157) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of memory. at java.lang.Double.valueOf(Double.java:492) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344) at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313) at org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208) at org.apache.pig.builtin.COR.combine(COR.java:258) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171) at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164) Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded ... 13 more ================================================================================ "GC overhead limit exceeded" means too much percentage of the time is spent on GC, and too less percentage is recovered. This feature is designed to prevent applications from running an extended period of time while making little or no progress because the heap is too small. I tried to disable this in Java by "export PIG_OPTS=-D-XX:-UseGCOverheadLimit" to avoid "GC overhead limit exceeded". It getting better, but still fail in the end and still can see it got thrown at one place. I will see if I can profile the memory usage. No clue so far. Johnny On Thu, Feb 21, 2013 at 11:39 AM, Houssam H. <[email protected]> wrote: > Hi, > > I have a file with a few hundreds of columns with doubles and I am > interested in creating a correlation matrix for the columns: > > A = load 'myData' using PigStorage(':'); > B = group A all; > D = foreach B generate group,COR(A.$0,A.$1,A.$2); > > For N parameters, the COR function will generate N(N-1)/2 correlations. > This is fine as long as N is less than 100: COR(A.$0,A.$1, .... A.$100); > However once N is more than 100 or 200 I have an out of memory error (of > course this would depend on the amount of RAM you have): > > 883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR > 6016: Out of memory. > 893 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map > reduce job(s) failed! > > My file is less than 50Mb, so pig is running all the time with only one > mapper. > > This behavior was the same whether I run the script locally (pig -x > local) or on Amazon ElasticMapReduce with multiple instances assigned > to the job. > > Is there a solution to be able to run the correlation function for a > big number of parameters? > > Thank you in advance! > > -Houssam >
