Hi again, I finally set my mind on going through java to make a sequence file for the naive bayes, but I still can't manage to find anyplace stating exactly what should be in the sequence file for mahout to process it with Naive Bayes.
I tried virtually every piece of code i found related to this subject, with no luck. My CSV file is like this : Label that I want to predict, feature 1, feature 2, ..., feature 1628 Could someone tell me exactly what Naive Bayes training procedure expects ? 2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>: > This relates to a previous question I have: Does mahout have a concept of > adapters which allow us to read data csv style data with filters to create > exact format for its various inputs (i.e. Recommender three column > format).? If not is it worth a jira? > > > > On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]> > wrote: > > > > Hi and thanks ! > > > > What about the command line, is there a way to do that using the existing > > command line ? > > > > > > > > > > 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>: > > > >> To convert input CSV to vectors, u can either: > >> > >> a) Use CSVIterator > >> b) use InputDriver > >> > >> Either of the above should generate vectors from input CSV that could > then > >> be fed into Mahout classifier/clustering jobs. > >> > >> > >> > >> > >> > >> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart < > >> [email protected]> wrote: > >> > >> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from > >> the command line. > >> > >> I know I have to feed the classifier with a seq file, so I tried to put > my > >> csv into one using the command seqdirectory, but even when I try with a > >> really small csv (less than 100Mo) I instantly get an > outOfMemoryException > >> from java heap space : > >> > >> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o > "/user/cacf/resSeq" > >>> -ow > >>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > >>> Running on hadoop, using > /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop > >>> and HADOOP_CONF_DIR=/etc/hadoop/conf > >>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar > >>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments: > >>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], > >>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], > >>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], > >>> --output=[/user/cacf/resSeq], > >> --overwrite=null, --startPhase=[0], > >>> --tempDir=[temp]} > >>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq > >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > >>> at java.util.Arrays.copyOf(Arrays.java:2367) > >>> at > >> > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) > >>> at > >> > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) > >>> at > >> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) > >>> at java.lang.StringBuilder.append(StringBuilder.java:132) > >>> at > >> > org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62) > >>> at > >> > org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90) > >>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) > >>> at > >> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) > >>> at > >> > org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > >>> at > >> > org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53) > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> at > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > >>> at > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >>> at java.lang.reflect.Method.invoke(Method.java:606) > >>> at > >> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > >>> at > >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) > >>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> at > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > >>> at > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >>> at java.lang.reflect.Method.invoke(Method.java:606) > >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) > >> > >> > >> Do you have an idea or a simple way to use Naive Bayes against my large > CSV > >> ? > >> > >> Thanks in advance ! > >> -- > >> Kévin Moulart > >> GSM France : +33 7 81 06 10 10 > >> GSM Belgique : +32 473 85 23 85 > >> Téléphone fixe : +32 2 771 88 45 > > > > > > > > -- > > Kévin Moulart > > GSM France : +33 7 81 06 10 10 > > GSM Belgique : +32 473 85 23 85 > > Téléphone fixe : +32 2 771 88 45 > -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45
