Thanks, that worked out perfectly on my 4GB of data!
Is there a way to run the Random Forest Partial Implementation on sparse
vectors instead of a tabular csv file? I'd like to classify based on
TF/IDF vectors which could have upwards of 100,000s columns if translated
into a csv format. I'd ideally like to use the outputs of the seq2sparse
command as classification inputs similar to the Naïve Bayes
classify-20newsgroups.sh example.
If running the Random Forest Partial Implementation on sparse vectors is
not possible, are there any tools to go from the TF/IDF output
of seq2sparse to a csv format? Conceptually this doesn't seem like a hard
coding task, but I always prefer to reuse someone else's work if that
option is available.
Regards,
Adam
On Thu, Dec 20, 2012 at 11:19 PM, deneche abdelhakim <[email protected]>wrote:
> Hi Adam,
>
> the BreimanExample is just meant as a test and an example, it doesn't even
> use MapReduce. Take a look at the following instead:
>
> https://cwiki.apache.org/MAHOUT/partial-implementation.html
>
>
>
>
> On Fri, Dec 21, 2012 at 2:59 AM, Marty Kube <
> [email protected]> wrote:
>
> > Hi Adam,
> >
> > This is an interesting problem. Increasing the heap size is not
> > necessarily going to solve the issue. The error you have:
> >
> >
> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> >
> > Is due to to much time CPU time spent in GC, as opposed to not enough
> heap
> > allocation. Decreasing your heap allocation may in fact help as GC is
> more
> > efficient on a smaller heap. You may have to consider GC tuning.
> >
> >
> >
> > On 12/20/2012 08:32 PM, Adam Baron wrote:
> >
> >> I'm trying to run the org.apache.mahout.classifier.**df.BreimanExample
> >> on a
> >> custom set of data that is ~4GB which has 500 Numerical Columns, 1
> >> Categorical Column with two possible label values and ~4 million rows.
> I
> >> already ran the org.apache.mahout.classifier.**df.tools.Describe to
> >> generate
> >> the dataset *.info file. However, despite bumping
> >> my mapred.child.java.opts up to -Xmx12288m, I still get this memory
> error
> >> below:
> >>
> >> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> >> exceeded
> >> at
> >> sun.misc.FloatingDecimal.**readJavaFormatString(**
> >> FloatingDecimal.java:1222)
> >> at java.lang.Double.parseDouble(**Double.java:510)
> >> at
> >> org.apache.mahout.classifier.**df.data.DataConverter.convert(**
> >> DataConverter.java:64)
> >> at
> >> org.apache.mahout.classifier.**df.data.DataLoader.loadData(**
> >> DataLoader.java:130)
> >> at
> >> org.apache.mahout.classifier.**df.BreimanExample.run(**
> >> BreimanExample.java:187)
> >> at
> org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
> >> at
> >> org.apache.mahout.classifier.**df.BreimanExample.main(**
> >> BreimanExample.java:125)
> >> at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
> >> Method)
> >> at
> >> sun.reflect.**NativeMethodAccessorImpl.**invoke(**
> >> NativeMethodAccessorImpl.java:**39)
> >> at
> >> sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
> >> DelegatingMethodAccessorImpl.**java:25)
> >> at java.lang.reflect.Method.**invoke(Method.java:597)
> >> at org.apache.hadoop.util.RunJar.**main(RunJar.java:186)
> >>
> >> I'm running on a pretty significant Hadoop cluster which has no problem
> >> running other sizable Mahout jobs such as K-Means Clustering on 100s GB
> >> n-gram TF/IDF files, so I'm thinking this is more of a
> configuration/code
> >> issue than a hardware issue. The small glass.data example from the
> >> website
> >> (https://cwiki.apache.org/**MAHOUT/breiman-example.html<
> https://cwiki.apache.org/MAHOUT/breiman-example.html>)
> >> worked flawlessly.
> >>
> >> I realize that if I decide to pursue Random Forest classification
> further,
> >> I'll need to write my own code to classify through a DecisionForest on a
> >> go
> >> forward basis (after the training set) since the BreimanExample is an
> >> example, not a tool. However, for this initial foray I merely want to
> see
> >> what type of Test Error numbers my custom set of data would yield,
> >> preferably without writing any custom code.
> >>
> >> Any suggestions?
> >>
> >> Thanks,
> >> Adam
> >>
> >>
> >
>