Hi Johnny,

Thank you for your help.
Yes indeed, setting mapred.min.split.size to 1 or 10Mb increased greatly
the number of mapper and thus made the job complete successfully.
For the reducers however, we can only have as much reducers as machines
running (by setting default_parallel) and this is a huge bottleneck.
As a comparison benchmark:
The correlation matrix for 300 columns and 10k rows on 3 AWS high memory
intensive extra large instances was computed in 9 minutes. The same
calculation was done using matlab on a laptop in 0.1 second.
I know that is an unfair comparison because correlation calculation is
prone to vectorization and that matlab was getting its data from RAM but
just to say that Hadoop is not a solution for every problem ;)

-Houssam.

On Sat, Feb 23, 2013 at 4:10 AM, Johnny Zhang <[email protected]> wrote:

> Hi, Houssam:
> I think above workaround works: increase number of mapper (two steps
> mentioned in last email). I just verify it by run same query against 1
> mapper, with 500 columns but only a few rows, and it pass. I guess it means
> if you can increase the number of mapper big enough so that each mapper
> take less rows, mapper can survive the CON calculation for huge number of
> columns.
>
> I think if each mapper doesn't got many rows, it can survive huge number of
> columns. It may reach the point that when columns number is too huge, even
> each mapper assigned one row of data, it will still crash. I haven't tested
> this limit yet, but I think it is much bigger than 500. Hope it is helpful.
>
> Johnny
>
>
> On Fri, Feb 22, 2013 at 3:04 PM, Johnny Zhang <[email protected]>
> wrote:
>
> > Hi, Houssam:
> > can you try change your HDFS block size smaller and also 'SET
> > pig.noSplitCombination false;' in Pig? (so that number of mapper will be
> > equal to number of file block)
> >
> > The OOM seems happen in COR function when it is trying to combine
> > different data chunk together in maper. So more mapper may help. I will
> try
> > it when I got a cluster to play with.
> >
> > Johnny
> >
> >
> > On Fri, Feb 22, 2013 at 2:18 PM, Johnny Zhang <[email protected]
> >wrote:
> >
> >> Hi, Houssam:
> >> What's the error in your pig log file? I were trying to reproduce it
> with
> >> 1000 rows, 500 columns.
> >> A = load 'random.txt' using PigStorage(':') as
> >> (f1:double,f2:double,.........,f500:double);
> >> B = group A all;
> >> D = foreach B generate group,COR(A.$0,A.$1,A.$2,A.$3,.......A.$499);
> >> dump D;
> >>
> >> The exception in pig log file is
> >> Backend error message
> >> ---------------------
> >> Error: java.lang.OutOfMemoryError: *GC overhead limit exceeded*
> >> at java.lang.Double.valueOf(Double.java:492)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >> at org.apache.pig.builtin.COR.combine(COR.java:258)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >>  at org.apache.pig.backend.hadoop.executionengine.physi
> >>
> >> Backend error message
> >> ---------------------
> >> Error: java.lang.OutOfMemoryError: Java heap space
> >>  at java.lang.Double.valueOf(Double.java:492)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >> at org.apache.pig.backend.hadoop.executionengine.physicalLayer.ex
> >>
> >> Backend error message
> >> ---------------------
> >> Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> at java.util.ArrayList.<init>(ArrayList.java:112)
> >>  at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67)
> >> at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67)
> >>  at
> >>
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >>  at org.apache.pig.builtin.COR$Inte
> >>
> >> Backend error message
> >> ---------------------
> >> Error: java.lang.OutOfMemoryError: Java heap space
> >>  at
> >>
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >> at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
> >>  at org.apache.pig.backend.hadoop.executionengin
> >>
> >> Error message from task (map) task_201302211102_0561_m_000000
> >> -------------------------------------------------------------
> >>  ERROR 6016: Out of memory.
> >>
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> >> memory.
> >> at java.lang.Double.valueOf(Double.java:492)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >> at org.apache.pig.builtin.COR.combine(COR.java:258)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> ... 13 more
> >>
> >>
> ================================================================================
> >> Error message from task (map) task_201302211102_0561_m_000000
> >> -------------------------------------------------------------
> >> ERROR 6016: Out of memory.
> >>
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> >> memory.
> >>  at java.lang.Double.valueOf(Double.java:492)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >> ... 13 more
> >>
> >>
> ================================================================================
> >> Error message from task (map) task_201302211102_0561_m_000000
> >> -------------------------------------------------------------
> >>  ERROR 6016: Out of memory.
> >>
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> >> memory.
> >> at java.util.ArrayList.<init>(ArrayList.java:112)
> >>  at org.apache.pig.data.DefaultTuple.<init>(DefaultTuple.java:67)
> >> at org.apache.pig.data.BinSedesTuple.<init>(BinSedesTuple.java:67)
> >>  at
> >>
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> ... 13 more
> >>
> >>
> ================================================================================
> >> Error message from task (map) task_201302211102_0561_m_000000
> >> -------------------------------------------------------------
> >> ERROR 6016: Out of memory.
> >>
> >> org.apache.pig.backend.executionengine.ExecException: ERROR 6016: Out of
> >> memory.
> >>  at
> >>
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:38)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:142)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >> at
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >> ... 12 more
> >>
> >>
> ================================================================================
> >> Pig Stack Trace
> >> ---------------
> >> ERROR 6016: Out of memory.
> >>
> >> org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable
> to
> >> open iterator for alias D. Backend error : Out of memory.
> >>  at org.apache.pig.PigServer.openIterator(PigServer.java:826)
> >> at
> >> org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:696)
> >>  at
> >>
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:320)
> >> at
> >>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:194)
> >>  at
> >>
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:170)
> >> at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
> >>  at org.apache.pig.Main.run(Main.java:538)
> >> at org.apache.pig.Main.main(Main.java:157)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>  at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>  at java.lang.reflect.Method.invoke(Method.java:597)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> >> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
> >> 6016: Out of memory.
> >> at java.lang.Double.valueOf(Double.java:492)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:390)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >>  at
> >>
> org.apache.pig.data.utils.SedesHelper.readGenericTuple(SedesHelper.java:144)
> >> at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:344)
> >>  at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:313)
> >> at
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCachedBag.java:208)
> >>  at org.apache.pig.builtin.COR.combine(COR.java:258)
> >> at org.apache.pig.builtin.COR$Intermed.exec(COR.java:171)
> >>  at org.apache.pig.builtin.COR$Intermed.exec(COR.java:164)
> >> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> ... 13 more
> >>
> >>
> ================================================================================
> >>
> >>
> >>
> >> "GC overhead limit exceeded" means too much percentage of the time is
> >> spent on GC, and too less percentage is recovered. This feature is
> designed
> >> to prevent applications from running an extended period of time while
> >> making little or no progress because the heap is too small.
> >>
> >> I tried to disable this in Java by "export
> >> PIG_OPTS=-D-XX:-UseGCOverheadLimit" to avoid "GC overhead limit
> exceeded".
> >> It getting better, but still fail in the end and still can see it got
> >> thrown at one place. I will see if I can profile the memory usage. No
> clue
> >> so far.
> >>
> >> Johnny
> >>
> >>
> >>
> >>
> >> On Thu, Feb 21, 2013 at 11:39 AM, Houssam H. <[email protected]> wrote:
> >>
> >>> Hi,
> >>>
> >>> I have a file with a few hundreds of columns with doubles and I am
> >>> interested in creating a correlation matrix for the columns:
> >>>
> >>> A = load 'myData' using PigStorage(':');
> >>> B = group A all;
> >>> D = foreach B generate group,COR(A.$0,A.$1,A.$2);
> >>>
> >>> For N parameters, the COR function will generate N(N-1)/2 correlations.
> >>> This is fine as long as N is less than 100: COR(A.$0,A.$1, ....
> A.$100);
> >>> However once N is more than 100 or 200 I have an out of memory error
> (of
> >>> course this would depend on the amount of RAM you have):
> >>>
> >>> 883 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR
> >>> 6016: Out of memory.
> >>> 893 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map
> >>> reduce job(s) failed!
> >>>
> >>> My file is less than 50Mb, so pig is running all the time with only one
> >>> mapper.
> >>>
> >>> This behavior was the same whether I run the script locally (pig -x
> >>> local) or on Amazon ElasticMapReduce with multiple instances assigned
> >>> to the job.
> >>>
> >>> Is there a solution to be able to run the correlation function for a
> >>> big number of parameters?
> >>>
> >>> Thank you in advance!
> >>>
> >>> -Houssam
> >>>
> >>
> >>
> >
>

Reply via email to