Hi Adam,
This is an interesting problem. Increasing the heap size is not
necessarily going to solve the issue. The error you have:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
Is due to to much time CPU time spent in GC, as opposed to not enough
heap allocation. Decreasing your heap allocation may in fact help as GC
is more efficient on a smaller heap. You may have to consider GC tuning.
On 12/20/2012 08:32 PM, Adam Baron wrote:
I'm trying to run the org.apache.mahout.classifier.df.BreimanExample on a
custom set of data that is ~4GB which has 500 Numerical Columns, 1
Categorical Column with two possible label values and ~4 million rows. I
already ran the org.apache.mahout.classifier.df.tools.Describe to generate
the dataset *.info file. However, despite bumping
my mapred.child.java.opts up to -Xmx12288m, I still get this memory error
below:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
exceeded
at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222)
at java.lang.Double.parseDouble(Double.java:510)
at
org.apache.mahout.classifier.df.data.DataConverter.convert(DataConverter.java:64)
at
org.apache.mahout.classifier.df.data.DataLoader.loadData(DataLoader.java:130)
at
org.apache.mahout.classifier.df.BreimanExample.run(BreimanExample.java:187)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.classifier.df.BreimanExample.main(BreimanExample.java:125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
I'm running on a pretty significant Hadoop cluster which has no problem
running other sizable Mahout jobs such as K-Means Clustering on 100s GB
n-gram TF/IDF files, so I'm thinking this is more of a configuration/code
issue than a hardware issue. The small glass.data example from the website
(https://cwiki.apache.org/MAHOUT/breiman-example.html) worked flawlessly.
I realize that if I decide to pursue Random Forest classification further,
I'll need to write my own code to classify through a DecisionForest on a go
forward basis (after the training set) since the BreimanExample is an
example, not a tool. However, for this initial foray I merely want to see
what type of Test Error numbers my custom set of data would yield,
preferably without writing any custom code.
Any suggestions?
Thanks,
Adam