Re: Use Naïve Bayes on a large CSV

Kevin Moulart Thu, 20 Feb 2014 04:52:07 -0800

Hi and thanks !

What about the command line, is there a way to do that using the existing
command line ?





2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>:

> To convert input CSV to vectors, u can either:
>
> a) Use CSVIterator
> b) use InputDriver
>
> Either of the above should generate vectors from input CSV that could then
> be fed into Mahout classifier/clustering jobs.
>
>
>
>
>
> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
> [email protected]> wrote:
>
> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
> the command line.
>
> I know I have to feed the classifier with a seq file, so I tried to put my
> csv into one using the command seqdirectory, but even when I try with a
> really small csv (less than 100Mo) I instantly get an outOfMemoryException
> from java heap space :
>
> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o "/user/cacf/resSeq"
> > -ow
> > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> > Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
> > and HADOOP_CONF_DIR=/etc/hadoop/conf
> > MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
> > 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
> > {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
> > --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
> > --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
> > --output=[/user/cacf/resSeq],
>  --overwrite=null, --startPhase=[0],
> > --tempDir=[temp]}
> > 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> > at java.util.Arrays.copyOf(Arrays.java:2367)
> >  at
> >
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
> > at
> >
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
> >  at
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> > at java.lang.StringBuilder.append(StringBuilder.java:132)
> >  at
> >
> org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)
> > at
> >
> org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)
> >  at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
> > at
>  org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
> >  at
> >
> org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> > at
> >
> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >  at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> >  at
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> > at
>  org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >  at java.lang.reflect.Method.invoke(Method.java:606)
> > at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>
>
> Do you have an idea or a simple way to use Naive Bayes against my large CSV
> ?
>
> Thanks in advance !
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45
>



-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Use Naïve Bayes on a large CSV

Reply via email to