Re: Use Naïve Bayes on a large CSV

Kevin Moulart Tue, 25 Feb 2014 02:34:23 -0800

All right I've manage to narrow it down to the LabelIndex, I went to see
the code but it isnt realy clear at all for me. What exactly should I
provide as Label Index ?


As a reminder, one line of my original file i=looks like :
0, 0.3222, 0, 1.543, ...
1, 0, 1.42, 1.12, ...

With the 0, 1 being the labels I'm trying to learn and the rest being the
data.

For now I have the previously mentionned java code that creates the
SequenceFile from my CSV, but when I then try to run the trainnb on it it
tries to create a LabelIndex and fails with an ArrayOutOfBoundException: 1.

Could someone tell me how to create the index, even manually at this point ?

Thanks in advance !


2014-02-24 15:41 GMT+01:00 Kevin Moulart <[email protected]>:

> I'll do that as soon as I manage to make it work ^^', that's a great idea !
>
> I'm stuck with this for now :
>
> public static void main(String[] args) throws IOException,
>> InterruptedException, ClassNotFoundException {
>> Configuration conf = new Configuration(true);
>>  FileSystem fs = FileSystem.get(conf);
>> BufferedReader reader = new BufferedReader(new FileReader(args[1]));
>> Path filePath = new Path(args[2]);
>>  if (fs.exists(filePath))
>> fs.delete(filePath, true);
>> SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
>>  filePath, Text.class, VectorWritable.class);
>> try {
>> String line;
>> while ((line = reader.readLine()) != null) {
>>  String[] c = line.split(args[3]);
>> if (c.length > 1) {
>> double[] d = new double[c.length];
>>  for (int i = 1; i < c.length; i++)
>> d[i] = Double.parseDouble(c[i]);
>> Vector vec = new RandomAccessSparseVector(c.length);
>>  vec.assign(d);
>> VectorWritable writable = new VectorWritable();
>> writable.set(vec);
>>  writer.append(new Text(c[0]), writable);
>> }
>> }
>> writer.close();
>>  } catch (Throwable t) {
>> t.printStackTrace();
>> }
>> reader.close();
>>  }
>
>
> Which produces a sequence file but Mahout's trainnb doesn't seem to like
> it that much, so I'm working on it for the moment.
>
>
> 2014-02-24 15:37 GMT+01:00 Ted Dunning <[email protected]>:
>
> Kevin,
>>
>> While this is fresh in your mind can you prepare a javadoc patch that
>> would
>> have helped you out?  And suggest other doc patches as well?
>>
>>
>>
>> On Mon, Feb 24, 2014 at 3:00 AM, Kevin Moulart <[email protected]
>> >wrote:
>>
>> > Thanks, that's about the clearest answer I got so far :)
>> >
>> >
>> > 2014-02-24 11:59 GMT+01:00 Sebastian Schelter <[email protected]>:
>> >
>> > > NaiveBayes expects a SequenceFile as input. The key is the class
>> label as
>> > > Text, the value are the features as VectorWritable.
>> > >
>> > > --sebastian
>> > >
>> > >
>> > > On 02/24/2014 11:51 AM, Kevin Moulart wrote:
>> > >
>> > >> Hi again,
>> > >> I finally set my mind on going through java to make a sequence file
>> for
>> > >> the
>> > >> naive bayes,
>> > >> but I still can't manage to find anyplace stating exactly what
>> should be
>> > >> in
>> > >> the sequence file
>> > >> for mahout to process it with Naive Bayes.
>> > >>
>> > >> I tried virtually every piece of code i found related to this
>> subject,
>> > >> with
>> > >> no luck.
>> > >>
>> > >> My CSV file is like this :
>> > >> Label that I want to predict, feature 1, feature 2, ..., feature 1628
>> > >>
>> > >> Could someone tell me exactly what Naive Bayes training procedure
>> > expects
>> > >> ?
>> > >>
>> > >>
>> > >> 2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>:
>> > >>
>> > >>  This relates to a previous question I have:  Does mahout have a
>> concept
>> > >>> of
>> > >>> adapters which allow us to read data csv style data with filters to
>> > >>> create
>> > >>> exact format  for its various inputs (i.e. Recommender three column
>> > >>> format).?  If not is it worth a jira?
>> > >>>
>> > >>>
>> > >>>  On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]
>> >
>> > >>>>
>> > >>> wrote:
>> > >>>
>> > >>>>
>> > >>>> Hi and thanks !
>> > >>>>
>> > >>>> What about the command line, is there a way to do that using the
>> > >>>> existing
>> > >>>> command line ?
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]
>> >:
>> > >>>>
>> > >>>>  To convert input CSV to vectors, u can either:
>> > >>>>>
>> > >>>>> a) Use CSVIterator
>> > >>>>> b) use InputDriver
>> > >>>>>
>> > >>>>> Either of the above should generate vectors from input CSV that
>> could
>> > >>>>>
>> > >>>> then
>> > >>>
>> > >>>> be fed into Mahout classifier/clustering jobs.
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
>> > >>>>> [email protected]> wrote:
>> > >>>>>
>> > >>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV
>> file
>> > >>>>> from
>> > >>>>> the command line.
>> > >>>>>
>> > >>>>> I know I have to feed the classifier with a seq file, so I tried
>> to
>> > put
>> > >>>>>
>> > >>>> my
>> > >>>
>> > >>>> csv into one using the command seqdirectory, but even when I try
>> with
>> > a
>> > >>>>> really small csv (less than 100Mo) I instantly get an
>> > >>>>>
>> > >>>> outOfMemoryException
>> > >>>
>> > >>>> from java heap space :
>> > >>>>>
>> > >>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
>> > >>>>>
>> > >>>> "/user/cacf/resSeq"
>> > >>>
>> > >>>> -ow
>> > >>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>> > >>>>>> Running on hadoop, using
>> > >>>>>>
>> > >>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
>> > >>>
>> > >>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
>> > >>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
>> > >>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line
>> arguments:
>> > >>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
>> > >>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
>> > >>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
>> > >>>>>> --output=[/user/cacf/resSeq],
>> > >>>>>>
>> > >>>>> --overwrite=null, --startPhase=[0],
>> > >>>>>
>> > >>>>>> --tempDir=[temp]}
>> > >>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting
>> /user/cacf/resSeq
>> > >>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap
>> > space
>> > >>>>>> at java.util.Arrays.copyOf(Arrays.java:2367)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  java.lang.AbstractStringBuilder.expandCapacity(
>> > >>> AbstractStringBuilder.java:130)
>> > >>>
>> > >>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  java.lang.AbstractStringBuilder.ensureCapacityInternal(
>> > >>> AbstractStringBuilder.java:114)
>> > >>>
>> > >>>> at
>> > >>>>>>
>> > >>>>>
>> > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
>> > >>>>>
>> > >>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  org.apache.mahout.text.PrefixAdditionFilter.process(
>> > >>> PrefixAdditionFilter.java:62)
>> > >>>
>> > >>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(
>> > >>> SequenceFilesFromDirectoryFilter.java:90)
>> > >>>
>> > >>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
>> > >>>>>
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.run(
>> > >>> SequenceFilesFromDirectory.java:98)
>> > >>>
>> > >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.main(
>> > >>> SequenceFilesFromDirectory.java:53)
>> > >>>
>> > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>> > >>> NativeMethodAccessorImpl.java:57)
>> > >>>
>> > >>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> > >>> DelegatingMethodAccessorImpl.java:43)
>> > >>>
>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
>> > >>> ProgramDriver.java:72)
>> > >>>
>> > >>>> at
>> > >>>>>>
>> > >>>>>
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
>> > >>>>>
>> > >>>>>> at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
>> > >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > >>>>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>> > >>> NativeMethodAccessorImpl.java:57)
>> > >>>
>> > >>>> at
>> > >>>>>>
>> > >>>>>
>> > >>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> > >>> DelegatingMethodAccessorImpl.java:43)
>> > >>>
>> > >>>> at java.lang.reflect.Method.invoke(Method.java:606)
>> > >>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>> Do you have an idea or a simple way to use Naive Bayes against my
>> > large
>> > >>>>>
>> > >>>> CSV
>> > >>>
>> > >>>> ?
>> > >>>>>
>> > >>>>> Thanks in advance !
>> > >>>>> --
>> > >>>>> Kévin Moulart
>> > >>>>> GSM France : +33 7 81 06 10 10
>> > >>>>> GSM Belgique : +32 473 85 23 85
>> > >>>>> Téléphone fixe : +32 2 771 88 45
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> --
>> > >>>> Kévin Moulart
>> > >>>> GSM France : +33 7 81 06 10 10
>> > >>>> GSM Belgique : +32 473 85 23 85
>> > >>>> Téléphone fixe : +32 2 771 88 45
>> > >>>>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >>
>> > >
>> >
>> >
>> > --
>> > Kévin Moulart
>> > GSM France : +33 7 81 06 10 10
>> > GSM Belgique : +32 473 85 23 85
>> > Téléphone fixe : +32 2 771 88 45
>> >
>>
>
>
>
> --
> Kévin Moulart
> GSM France : +33 7 81 06 10 10
> GSM Belgique : +32 473 85 23 85
> Téléphone fixe : +32 2 771 88 45
>



-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Use Naïve Bayes on a large CSV

Reply via email to