Thanks, that's about the clearest answer I got so far :)
2014-02-24 11:59 GMT+01:00 Sebastian Schelter <[email protected]>: > NaiveBayes expects a SequenceFile as input. The key is the class label as > Text, the value are the features as VectorWritable. > > --sebastian > > > On 02/24/2014 11:51 AM, Kevin Moulart wrote: > >> Hi again, >> I finally set my mind on going through java to make a sequence file for >> the >> naive bayes, >> but I still can't manage to find anyplace stating exactly what should be >> in >> the sequence file >> for mahout to process it with Naive Bayes. >> >> I tried virtually every piece of code i found related to this subject, >> with >> no luck. >> >> My CSV file is like this : >> Label that I want to predict, feature 1, feature 2, ..., feature 1628 >> >> Could someone tell me exactly what Naive Bayes training procedure expects >> ? >> >> >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>: >> >> This relates to a previous question I have: Does mahout have a concept >>> of >>> adapters which allow us to read data csv style data with filters to >>> create >>> exact format for its various inputs (i.e. Recommender three column >>> format).? If not is it worth a jira? >>> >>> >>> On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]> >>>> >>> wrote: >>> >>>> >>>> Hi and thanks ! >>>> >>>> What about the command line, is there a way to do that using the >>>> existing >>>> command line ? >>>> >>>> >>>> >>>> >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>: >>>> >>>> To convert input CSV to vectors, u can either: >>>>> >>>>> a) Use CSVIterator >>>>> b) use InputDriver >>>>> >>>>> Either of the above should generate vectors from input CSV that could >>>>> >>>> then >>> >>>> be fed into Mahout classifier/clustering jobs. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart < >>>>> [email protected]> wrote: >>>>> >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file >>>>> from >>>>> the command line. >>>>> >>>>> I know I have to feed the classifier with a seq file, so I tried to put >>>>> >>>> my >>> >>>> csv into one using the command seqdirectory, but even when I try with a >>>>> really small csv (less than 100Mo) I instantly get an >>>>> >>>> outOfMemoryException >>> >>>> from java heap space : >>>>> >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o >>>>> >>>> "/user/cacf/resSeq" >>> >>>> -ow >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >>>>>> Running on hadoop, using >>>>>> >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop >>> >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments: >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], >>>>>> --output=[/user/cacf/resSeq], >>>>>> >>>>> --overwrite=null, --startPhase=[0], >>>>> >>>>>> --tempDir=[temp]} >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367) >>>>>> at >>>>>> >>>>> >>>>> java.lang.AbstractStringBuilder.expandCapacity( >>> AbstractStringBuilder.java:130) >>> >>>> at >>>>>> >>>>> >>>>> java.lang.AbstractStringBuilder.ensureCapacityInternal( >>> AbstractStringBuilder.java:114) >>> >>>> at >>>>>> >>>>> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) >>>>> >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132) >>>>>> at >>>>>> >>>>> >>>>> org.apache.mahout.text.PrefixAdditionFilter.process( >>> PrefixAdditionFilter.java:62) >>> >>>> at >>>>>> >>>>> >>>>> org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept( >>> SequenceFilesFromDirectoryFilter.java:90) >>> >>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) >>>>>> at >>>>>> >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) >>>>> >>>>>> at >>>>>> >>>>> >>>>> org.apache.mahout.text.SequenceFilesFromDirectory.run( >>> SequenceFilesFromDirectory.java:98) >>> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>>>>> at >>>>>> >>>>> >>>>> org.apache.mahout.text.SequenceFilesFromDirectory.main( >>> SequenceFilesFromDirectory.java:53) >>> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> >>>>> >>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >>> NativeMethodAccessorImpl.java:57) >>> >>>> at >>>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> DelegatingMethodAccessorImpl.java:43) >>> >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>>>> at >>>>>> >>>>> >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke( >>> ProgramDriver.java:72) >>> >>>> at >>>>>> >>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) >>>>> >>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> at >>>>>> >>>>> >>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >>> NativeMethodAccessorImpl.java:57) >>> >>>> at >>>>>> >>>>> >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> DelegatingMethodAccessorImpl.java:43) >>> >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) >>>>>> >>>>> >>>>> >>>>> Do you have an idea or a simple way to use Naive Bayes against my large >>>>> >>>> CSV >>> >>>> ? >>>>> >>>>> Thanks in advance ! >>>>> -- >>>>> Kévin Moulart >>>>> GSM France : +33 7 81 06 10 10 >>>>> GSM Belgique : +32 473 85 23 85 >>>>> Téléphone fixe : +32 2 771 88 45 >>>>> >>>> >>>> >>>> >>>> -- >>>> Kévin Moulart >>>> GSM France : +33 7 81 06 10 10 >>>> GSM Belgique : +32 473 85 23 85 >>>> Téléphone fixe : +32 2 771 88 45 >>>> >>> >>> >> >> >> > -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45
