All right I've manage to narrow it down to the LabelIndex, I went to see the code but it isnt realy clear at all for me. What exactly should I provide as Label Index ?
As a reminder, one line of my original file i=looks like : 0, 0.3222, 0, 1.543, ... 1, 0, 1.42, 1.12, ... With the 0, 1 being the labels I'm trying to learn and the rest being the data. For now I have the previously mentionned java code that creates the SequenceFile from my CSV, but when I then try to run the trainnb on it it tries to create a LabelIndex and fails with an ArrayOutOfBoundException: 1. Could someone tell me how to create the index, even manually at this point ? Thanks in advance ! 2014-02-24 15:41 GMT+01:00 Kevin Moulart <[email protected]>: > I'll do that as soon as I manage to make it work ^^', that's a great idea ! > > I'm stuck with this for now : > > public static void main(String[] args) throws IOException, >> InterruptedException, ClassNotFoundException { >> Configuration conf = new Configuration(true); >> FileSystem fs = FileSystem.get(conf); >> BufferedReader reader = new BufferedReader(new FileReader(args[1])); >> Path filePath = new Path(args[2]); >> if (fs.exists(filePath)) >> fs.delete(filePath, true); >> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, >> filePath, Text.class, VectorWritable.class); >> try { >> String line; >> while ((line = reader.readLine()) != null) { >> String[] c = line.split(args[3]); >> if (c.length > 1) { >> double[] d = new double[c.length]; >> for (int i = 1; i < c.length; i++) >> d[i] = Double.parseDouble(c[i]); >> Vector vec = new RandomAccessSparseVector(c.length); >> vec.assign(d); >> VectorWritable writable = new VectorWritable(); >> writable.set(vec); >> writer.append(new Text(c[0]), writable); >> } >> } >> writer.close(); >> } catch (Throwable t) { >> t.printStackTrace(); >> } >> reader.close(); >> } > > > Which produces a sequence file but Mahout's trainnb doesn't seem to like > it that much, so I'm working on it for the moment. > > > 2014-02-24 15:37 GMT+01:00 Ted Dunning <[email protected]>: > > Kevin, >> >> While this is fresh in your mind can you prepare a javadoc patch that >> would >> have helped you out? And suggest other doc patches as well? >> >> >> >> On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <[email protected] >> >wrote: >> >> > Thanks, that's about the clearest answer I got so far :) >> > >> > >> > 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <[email protected]>: >> > >> > > NaiveBayes expects a SequenceFile as input. The key is the class >> label as >> > > Text, the value are the features as VectorWritable. >> > > >> > > --sebastian >> > > >> > > >> > > On 02/24/2014 11:51 AM, Kevin Moulart wrote: >> > > >> > >> Hi again, >> > >> I finally set my mind on going through java to make a sequence file >> for >> > >> the >> > >> naive bayes, >> > >> but I still can't manage to find anyplace stating exactly what >> should be >> > >> in >> > >> the sequence file >> > >> for mahout to process it with Naive Bayes. >> > >> >> > >> I tried virtually every piece of code i found related to this >> subject, >> > >> with >> > >> no luck. >> > >> >> > >> My CSV file is like this : >> > >> Label that I want to predict, feature 1, feature 2, ..., feature 1628 >> > >> >> > >> Could someone tell me exactly what Naive Bayes training procedure >> > expects >> > >> ? >> > >> >> > >> >> > >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>: >> > >> >> > >> This relates to a previous question I have: Does mahout have a >> concept >> > >>> of >> > >>> adapters which allow us to read data csv style data with filters to >> > >>> create >> > >>> exact format for its various inputs (i.e. Recommender three column >> > >>> format).? If not is it worth a jira? >> > >>> >> > >>> >> > >>> On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected] >> > >> > >>>> >> > >>> wrote: >> > >>> >> > >>>> >> > >>>> Hi and thanks ! >> > >>>> >> > >>>> What about the command line, is there a way to do that using the >> > >>>> existing >> > >>>> command line ? >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected] >> >: >> > >>>> >> > >>>> To convert input CSV to vectors, u can either: >> > >>>>> >> > >>>>> a) Use CSVIterator >> > >>>>> b) use InputDriver >> > >>>>> >> > >>>>> Either of the above should generate vectors from input CSV that >> could >> > >>>>> >> > >>>> then >> > >>> >> > >>>> be fed into Mahout classifier/clustering jobs. >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart < >> > >>>>> [email protected]> wrote: >> > >>>>> >> > >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV >> file >> > >>>>> from >> > >>>>> the command line. >> > >>>>> >> > >>>>> I know I have to feed the classifier with a seq file, so I tried >> to >> > put >> > >>>>> >> > >>>> my >> > >>> >> > >>>> csv into one using the command seqdirectory, but even when I try >> with >> > a >> > >>>>> really small csv (less than 100Mo) I instantly get an >> > >>>>> >> > >>>> outOfMemoryException >> > >>> >> > >>>> from java heap space : >> > >>>>> >> > >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o >> > >>>>> >> > >>>> "/user/cacf/resSeq" >> > >>> >> > >>>> -ow >> > >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> > >>>>>> Running on hadoop, using >> > >>>>>> >> > >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop >> > >>> >> > >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf >> > >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar >> > >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line >> arguments: >> > >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], >> > >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], >> > >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], >> > >>>>>> --output=[/user/cacf/resSeq], >> > >>>>>> >> > >>>>> --overwrite=null, --startPhase=[0], >> > >>>>> >> > >>>>>> --tempDir=[temp]} >> > >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting >> /user/cacf/resSeq >> > >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap >> > space >> > >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367) >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> java.lang.AbstractStringBuilder.expandCapacity( >> > >>> AbstractStringBuilder.java:130) >> > >>> >> > >>>> at >> > >>>>>> >> > >>>>> >> > >>>>> java.lang.AbstractStringBuilder.ensureCapacityInternal( >> > >>> AbstractStringBuilder.java:114) >> > >>> >> > >>>> at >> > >>>>>> >> > >>>>> >> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) >> > >>>>> >> > >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132) >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> org.apache.mahout.text.PrefixAdditionFilter.process( >> > >>> PrefixAdditionFilter.java:62) >> > >>> >> > >>>> at >> > >>>>>> >> > >>>>> >> > >>>>> org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept( >> > >>> SequenceFilesFromDirectoryFilter.java:90) >> > >>> >> > >>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) >> > >>>>>> at >> > >>>>>> >> > >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) >> > >>>>> >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> org.apache.mahout.text.SequenceFilesFromDirectory.run( >> > >>> SequenceFilesFromDirectory.java:98) >> > >>> >> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >> > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> org.apache.mahout.text.SequenceFilesFromDirectory.main( >> > >>> SequenceFilesFromDirectory.java:53) >> > >>> >> > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >> > >>> NativeMethodAccessorImpl.java:57) >> > >>> >> > >>>> at >> > >>>>>> >> > >>>>> >> > >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >> > >>> DelegatingMethodAccessorImpl.java:43) >> > >>> >> > >>>> at java.lang.reflect.Method.invoke(Method.java:606) >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke( >> > >>> ProgramDriver.java:72) >> > >>> >> > >>>> at >> > >>>>>> >> > >>>>> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) >> > >>>>> >> > >>>>>> at >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) >> > >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> > >>>>>> at >> > >>>>>> >> > >>>>> >> > >>>>> sun.reflect.NativeMethodAccessorImpl.invoke( >> > >>> NativeMethodAccessorImpl.java:57) >> > >>> >> > >>>> at >> > >>>>>> >> > >>>>> >> > >>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke( >> > >>> DelegatingMethodAccessorImpl.java:43) >> > >>> >> > >>>> at java.lang.reflect.Method.invoke(Method.java:606) >> > >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208) >> > >>>>>> >> > >>>>> >> > >>>>> >> > >>>>> Do you have an idea or a simple way to use Naive Bayes against my >> > large >> > >>>>> >> > >>>> CSV >> > >>> >> > >>>> ? >> > >>>>> >> > >>>>> Thanks in advance ! >> > >>>>> -- >> > >>>>> Kévin Moulart >> > >>>>> GSM France : +33 7 81 06 10 10 >> > >>>>> GSM Belgique : +32 473 85 23 85 >> > >>>>> Téléphone fixe : +32 2 771 88 45 >> > >>>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> -- >> > >>>> Kévin Moulart >> > >>>> GSM France : +33 7 81 06 10 10 >> > >>>> GSM Belgique : +32 473 85 23 85 >> > >>>> Téléphone fixe : +32 2 771 88 45 >> > >>>> >> > >>> >> > >>> >> > >> >> > >> >> > >> >> > > >> > >> > >> > -- >> > Kévin Moulart >> > GSM France : +33 7 81 06 10 10 >> > GSM Belgique : +32 473 85 23 85 >> > Téléphone fixe : +32 2 771 88 45 >> > >> > > > > -- > Kévin Moulart > GSM France : +33 7 81 06 10 10 > GSM Belgique : +32 473 85 23 85 > Téléphone fixe : +32 2 771 88 45 > -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45
