Re: Use Naïve Bayes on a large CSV

Kevin Moulart Mon, 24 Feb 2014 02:53:26 -0800

Hi again,
I finally set my mind on going through java to make a sequence file for the
naive bayes,
but I still can't manage to find anyplace stating exactly what should be in
the sequence file
for mahout to process it with Naive Bayes.


I tried virtually every piece of code i found related to this subject, with
no luck.

My CSV file is like this :
Label that I want to predict, feature 1, feature 2, ..., feature 1628

Could someone tell me exactly what Naive Bayes training procedure expects ?


2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>:

> This relates to a previous question I have:  Does mahout have a concept of
> adapters which allow us to read data csv style data with filters to create
> exact format  for its various inputs (i.e. Recommender three column
> format).?  If not is it worth a jira?
>
>
> > On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]>
> wrote:
> >
> > Hi and thanks !
> >
> > What about the command line, is there a way to do that using the existing
> > command line ?
> >
> >
> >
> >
> > 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>:
> >
> >> To convert input CSV to vectors, u can either:
> >>
> >> a) Use CSVIterator
> >> b) use InputDriver
> >>
> >> Either of the above should generate vectors from input CSV that could
> then
> >> be fed into Mahout classifier/clustering jobs.
> >>
> >>
> >>
> >>
> >>
> >> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
> >> [email protected]> wrote:
> >>
> >> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
> >> the command line.
> >>
> >> I know I have to feed the classifier with a seq file, so I tried to put
> my
> >> csv into one using the command seqdirectory, but even when I try with a
> >> really small csv (less than 100Mo) I instantly get an
> outOfMemoryException
> >> from java heap space :
> >>
> >> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
> "/user/cacf/resSeq"
> >>> -ow
> >>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> >>> Running on hadoop, using
> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
> >>> and HADOOP_CONF_DIR=/etc/hadoop/conf
> >>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
> >>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
> >>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
> >>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
> >>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
> >>> --output=[/user/cacf/resSeq],
> >> --overwrite=null, --startPhase=[0],
> >>> --tempDir=[temp]}
> >>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>> at java.util.Arrays.copyOf(Arrays.java:2367)
> >>> at
> >>
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
> >>> at
> >>
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
> >>> at
> >> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
> >>> at java.lang.StringBuilder.append(StringBuilder.java:132)
> >>> at
> >>
> org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)
> >>> at
> >>
> org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)
> >>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
> >>> at
> >> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
> >>> at
> >>
> org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)
> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> >>> at
> >>
> org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >>> at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>> at java.lang.reflect.Method.invoke(Method.java:606)
> >>> at
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
> >>> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
> >>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >>> at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>> at java.lang.reflect.Method.invoke(Method.java:606)
> >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> >>
> >>
> >> Do you have an idea or a simple way to use Naive Bayes against my large
> CSV
> >> ?
> >>
> >> Thanks in advance !
> >> --
> >> Kévin Moulart
> >> GSM France : +33 7 81 06 10 10
> >> GSM Belgique : +32 473 85 23 85
> >> Téléphone fixe : +32 2 771 88 45
> >
> >
> >
> > --
> > Kévin Moulart
> > GSM France : +33 7 81 06 10 10
> > GSM Belgique : +32 473 85 23 85
> > Téléphone fixe : +32 2 771 88 45
>



-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Use Naïve Bayes on a large CSV

Reply via email to