I finally managed to make it run, I had to format the class label in the input file with a / in the name so I put Yes/1 or No/0 instead of just 1 or 0.
But then I noticed when testing the model that it doesn't classify all the data : 14/02/25 16:16:30 INFO mapred.JobClient: Map-Reduce Framework 14/02/25 16:16:30 INFO mapred.JobClient: Map input records=*300000* 14/02/25 16:16:30 INFO mapred.JobClient: Map output records=300000 14/02/25 16:16:30 INFO mapred.JobClient: Input split bytes=476 14/02/25 16:16:30 INFO mapred.JobClient: Spilled Records=0 14/02/25 16:16:30 INFO mapred.JobClient: CPU time spent (ms)=32000 14/02/25 16:16:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=834502656 14/02/25 16:16:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3738030080 14/02/25 16:16:30 INFO mapred.JobClient: Total committed heap usage (bytes)=918552576 14/02/25 16:16:31 INFO test.TestNaiveBayesDriver: Standard NB Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 36078 91.3552% Incorrectly Classified Instances : 3414 8.6448% Total Classified Instances : *39492* ======================================================= Confusion Matrix ------------------------------------------------------- a b <--Classified as 34445 2114 | 36559 a = 0 1300 1633 | 2933 b = 1 I did the testnb with the exact same file I used to train the model. Any idea ? 2014-02-25 11:33 GMT+01:00 Kevin Moulart <[email protected]>: > All right I've manage to narrow it down to the LabelIndex, I went to see > the code but it isnt realy clear at all for me. What exactly should I > provide as Label Index ? > > As a reminder, one line of my original file i=looks like : > 0, 0.3222, 0, 1.543, ... > 1, 0, 1.42, 1.12, ... > > With the 0, 1 being the labels I'm trying to learn and the rest being the > data. > > For now I have the previously mentionned java code that creates the > SequenceFile from my CSV, but when I then try to run the trainnb on it it > tries to create a LabelIndex and fails with an ArrayOutOfBoundException: 1. > > Could someone tell me how to create the index, even manually at this point > ? > > Thanks in advance ! > > > 2014-02-24 15:41 GMT+01:00 Kevin Moulart <[email protected]>: > > I'll do that as soon as I manage to make it work ^^', that's a great idea ! >> >> I'm stuck with this for now : >> >> public static void main(String[] args) throws IOException, >>> InterruptedException, ClassNotFoundException { >>> Configuration conf = new Configuration(true); >>> FileSystem fs = FileSystem.get(conf); >>> BufferedReader reader = new BufferedReader(new FileReader(args[1])); >>> Path filePath = new Path(args[2]); >>> if (fs.exists(filePath)) >>> fs.delete(filePath, true); >>> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, >>> filePath, Text.class, VectorWritable.class); >>> try { >>> String line; >>> while ((line = reader.readLine()) != null) { >>> String[] c = line.split(args[3]); >>> if (c.length > 1) { >>> double[] d = new double[c.length]; >>> for (int i = 1; i < c.length; i++) >>> d[i] = Double.parseDouble(c[i]); >>> Vector vec = new RandomAccessSparseVector(c.length); >>> vec.assign(d); >>> VectorWritable writable = new VectorWritable(); >>> writable.set(vec); >>> writer.append(new Text(c[0]), writable); >>> } >>> } >>> writer.close(); >>> } catch (Throwable t) { >>> t.printStackTrace(); >>> } >>> reader.close(); >>> } >> >> >> Which produces a sequence file but Mahout's trainnb doesn't seem to like >> it that much, so I'm working on it for the moment. >> >> >> 2014-02-24 15:37 GMT+01:00 Ted Dunning <[email protected]>: >> >> Kevin, >>> >>> While this is fresh in your mind can you prepare a javadoc patch that >>> would >>> have helped you out? And suggest other doc patches as well? >>> >>> >>> >>> On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <[email protected] >>> >wrote: >>> >>> > Thanks, that's about the clearest answer I got so far :) >>> > >>> > >>> > 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <[email protected]>: >>> > >>> > > NaiveBayes expects a SequenceFile as input. The key is the class >>> label as >>> > > Text, the value are the features as VectorWritable. >>> > > >>> > > --sebastian >>> > > >>> > > >>> > > On 02/24/2014 11:51 AM, Kevin Moulart wrote: >>> > > >>> > >> Hi again, >>> > >> I finally set my mind on going through java to make a sequence file >>> for >>> > >> the >>> > >> naive bayes, >>> > >> but I still can't manage to find anyplace stating exactly what >>> should be >>> > >> in >>> > >> the sequence file >>> > >> for mahout to process it with Naive Bayes. >>> > >> >>> > >> I tried virtually every piece of code i found related to this >>> subject, >>> > >> with >>> > >> no luck. >>> > >> >>> > >> My CSV file is like this : >>> > >> Label that I want to predict, feature 1, feature 2, ..., feature >>> 1628 >>> > >> >>> > >> Could someone tell me exactly what Naive Bayes training procedure >>> > expects >>> > >> ? >>> > >> >>> > >> >>> > >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>: >>> > >> >>> > >> This relates to a previous question I have: Does mahout have a >>> concept >>> > >>> of >>> > >>> adapters which allow us to read data csv style data with filters to >>> > >>> create >>> > >>> exact format for its various inputs (i.e. Recommender three column >>> > >>> format).? If not is it worth a jira? >>> > >>> >>> > >>> >>> > >>> On Feb 20, 2014, at 7:50 AM, Kevin Moulart < >>> [email protected]> >>> > >>>> >>> > >>> wrote: >>> > >>> >>> > >>>> >>> > >>>> Hi and thanks ! >>> > >>>> >>> > >>>> What about the command line, is there a way to do that using the >>> > >>>> existing >>> > >>>> command line ? >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected] >>> >: >>> > >>>> >>> > >>>> To convert input CSV to vectors, u can either: >>> > >>>>> >>> > >>>>> a) Use CSVIterator >>> > >>>>> b) use InputDriver >>> > >>>>> >>> > >>>>> Either of the above should generate vectors from input CSV that >>> could >>> > >>>>> >>> > >>>> then >>> > >>> >>> > >>>> be fed into Mahout classifier/clustering jobs. >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart < >>> > >>>>> [email protected]> wrote: >>> > >>>>> >>> > >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV >>> file >>> > >>>>> from >>> > >>>>> the command line. >>> > >>>>> >>> > >>>>> I know I have to feed the classifier with a seq file, so I tried >>> to >>> > put >>> > >>>>> >>> > >>>> my >>> > >>> >>> > >>>> csv into one using the command seqdirectory, but even when I try >>> with >>> > a >>> > >>>>> really small csv (less than 100Mo) I instantly get an >>> > >>>>> >>> > >>>> outOfMemoryException >>> > >>> >>> > >>>> from java heap space : >>> > >>>>> >>> > >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o >>> > >>>>> >>> > >>>> "/user/cacf/resSeq" >>> > >>> >>> > >>>> -ow >>> > >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >>> > >>>>>> Running on hadoop, using >>> > >>>>>> >>> > >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop >>> > >>> >>> > >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf >>> > >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar >>> > >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line >>> arguments: >>> > >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], >>> > >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], >>> > >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], >>> > >>>>>> --output=[/user/cacf/resSeq], >>> > >>>>>> >>> > >>>>> --overwrite=null, --startPhase=[0], >>> > >>>>> >>> > >>>>>> --tempDir=[temp]} >>> > >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting >>> /user/cacf/resSeq >>> > >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >>> > space >>> > >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> java.lang.AbstractStringBuilder.expandCapacity( >>> > >>> AbstractStringBuilder.java:130) >>> > >>> >>> > >>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> java.lang.AbstractStringBuilder.ensureCapacityInternal( >>> > >>> AbstractStringBuilder.java:114) >>> > >>> >>> > >>>> at >>> > >>>>>> >>> > >>>>> >>> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) >>> > >>>>> >>> > >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> org.apache.mahout.text.PrefixAdditionFilter.process( >>> > >>> PrefixAdditionFilter.java:62) >>> > >>> >>> > >>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept( >>> > >>> SequenceFilesFromDirectoryFilter.java:90) >>> > >>> >>> > >>>> at >>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) >>> > >>>>> >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> org.apache.mahout.text.SequenceFilesFromDirectory.run( >>> > >>> SequenceFilesFromDirectory.java:98) >>> > >>> >>> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>> > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> org.apache.mahout.text.SequenceFilesFromDirectory.main( >>> > >>> SequenceFilesFromDirectory.java:53) >>> > >>> >>> > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >>> > >>> NativeMethodAccessorImpl.java:57) >>> > >>> >>> > >>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> > >>> DelegatingMethodAccessorImpl.java:43) >>> > >>> >>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke( >>> > >>> ProgramDriver.java:72) >>> > >>> >>> > >>>> at >>> > >>>>>> >>> > >>>>> >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) >>> > >>>>> >>> > >>>>>> at >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) >>> > >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> > >>>>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >>> > >>> NativeMethodAccessorImpl.java:57) >>> > >>> >>> > >>>> at >>> > >>>>>> >>> > >>>>> >>> > >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >>> > >>> DelegatingMethodAccessorImpl.java:43) >>> > >>> >>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>> > >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) >>> > >>>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> Do you have an idea or a simple way to use Naive Bayes against my >>> > large >>> > >>>>> >>> > >>>> CSV >>> > >>> >>> > >>>> ? >>> > >>>>> >>> > >>>>> Thanks in advance ! >>> > >>>>> -- >>> > >>>>> Kévin Moulart >>> > >>>>> GSM France : +33 7 81 06 10 10 >>> > >>>>> GSM Belgique : +32 473 85 23 85 >>> > >>>>> Téléphone fixe : +32 2 771 88 45 >>> > >>>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> -- >>> > >>>> Kévin Moulart >>> > >>>> GSM France : +33 7 81 06 10 10 >>> > >>>> GSM Belgique : +32 473 85 23 85 >>> > >>>> Téléphone fixe : +32 2 771 88 45 >>> > >>>> >>> > >>> >>> > >>> >>> > >> >>> > >> >>> > >> >>> > > >>> > >>> > >>> > -- >>> > Kévin Moulart >>> > GSM France : +33 7 81 06 10 10 >>> > GSM Belgique : +32 473 85 23 85 >>> > Téléphone fixe : +32 2 771 88 45 >>> > >>> >> >> >> >> -- >> Kévin Moulart >> GSM France : +33 7 81 06 10 10 >> GSM Belgique : +32 473 85 23 85 >> Téléphone fixe : +32 2 771 88 45 >> > > > > -- > Kévin Moulart > GSM France : +33 7 81 06 10 10 > GSM Belgique : +32 473 85 23 85 > Téléphone fixe : +32 2 771 88 45 > -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45
