Re: Use Naïve Bayes on a large CSV

Kevin Moulart Mon, 24 Feb 2014 03:01:34 -0800

Thanks, that's about the clearest answer I got so far :)


2014-02-24 11:59 GMT+01:00 Sebastian Schelter <[email protected]>:

> NaiveBayes expects a SequenceFile as input. The key is the class label as
> Text, the value are the features as VectorWritable.
>
> --sebastian
>
>
> On 02/24/2014 11:51 AM, Kevin Moulart wrote:
>
>> Hi again,
>> I finally set my mind on going through java to make a sequence file for
>> the
>> naive bayes,
>> but I still can't manage to find anyplace stating exactly what should be
>> in
>> the sequence file
>> for mahout to process it with Naive Bayes.
>>
>> I tried virtually every piece of code i found related to this subject,
>> with
>> no luck.
>>
>> My CSV file is like this :
>> Label that I want to predict, feature 1, feature 2, ..., feature 1628
>>
>> Could someone tell me exactly what Naive Bayes training procedure expects
>> ?
>>
>>
>> 2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>:
>>
>>  This relates to a previous question I have:  Does mahout have a concept
>>> of
>>> adapters which allow us to read data csv style data with filters to
>>> create
>>> exact format  for its various inputs (i.e. Recommender three column
>>> format).?  If not is it worth a jira?
>>>
>>>
>>>  On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]>
>>>>
>>> wrote:
>>>
>>>>
>>>> Hi and thanks !
>>>>
>>>> What about the command line, is there a way to do that using the
>>>> existing
>>>> command line ?
>>>>
>>>>
>>>>
>>>>
>>>> 2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>:
>>>>
>>>>  To convert input CSV to vectors, u can either:
>>>>>
>>>>> a) Use CSVIterator
>>>>> b) use InputDriver
>>>>>
>>>>> Either of the above should generate vectors from input CSV that could
>>>>>
>>>> then
>>>
>>>> be fed into Mahout classifier/clustering jobs.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file
>>>>> from
>>>>> the command line.
>>>>>
>>>>> I know I have to feed the classifier with a seq file, so I tried to put
>>>>>
>>>> my
>>>
>>>> csv into one using the command seqdirectory, but even when I try with a
>>>>> really small csv (less than 100Mo) I instantly get an
>>>>>
>>>> outOfMemoryException
>>>
>>>> from java heap space :
>>>>>
>>>>> mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
>>>>>
>>>> "/user/cacf/resSeq"
>>>
>>>> -ow
>>>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>>>>> Running on hadoop, using
>>>>>>
>>>>> /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
>>>
>>>> and HADOOP_CONF_DIR=/etc/hadoop/conf
>>>>>> MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
>>>>>> 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
>>>>>> {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
>>>>>> --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
>>>>>> --input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
>>>>>> --output=[/user/cacf/resSeq],
>>>>>>
>>>>> --overwrite=null, --startPhase=[0],
>>>>>
>>>>>> --tempDir=[temp]}
>>>>>> 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
>>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>>>> at java.util.Arrays.copyOf(Arrays.java:2367)
>>>>>> at
>>>>>>
>>>>>
>>>>>  java.lang.AbstractStringBuilder.expandCapacity(
>>> AbstractStringBuilder.java:130)
>>>
>>>> at
>>>>>>
>>>>>
>>>>>  java.lang.AbstractStringBuilder.ensureCapacityInternal(
>>> AbstractStringBuilder.java:114)
>>>
>>>> at
>>>>>>
>>>>> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
>>>>>
>>>>>> at java.lang.StringBuilder.append(StringBuilder.java:132)
>>>>>> at
>>>>>>
>>>>>
>>>>>  org.apache.mahout.text.PrefixAdditionFilter.process(
>>> PrefixAdditionFilter.java:62)
>>>
>>>> at
>>>>>>
>>>>>
>>>>>  org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(
>>> SequenceFilesFromDirectoryFilter.java:90)
>>>
>>>> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
>>>>>> at
>>>>>>
>>>>> org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
>>>>>
>>>>>> at
>>>>>>
>>>>>
>>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.run(
>>> SequenceFilesFromDirectory.java:98)
>>>
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>>>> at
>>>>>>
>>>>>
>>>>>  org.apache.mahout.text.SequenceFilesFromDirectory.main(
>>> SequenceFilesFromDirectory.java:53)
>>>
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at
>>>>>>
>>>>>
>>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>> NativeMethodAccessorImpl.java:57)
>>>
>>>> at
>>>>>>
>>>>>
>>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:43)
>>>
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>> at
>>>>>>
>>>>>
>>>>>  org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(
>>> ProgramDriver.java:72)
>>>
>>>> at
>>>>>>
>>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
>>>>>
>>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at
>>>>>>
>>>>>
>>>>>  sun.reflect.NativeMethodAccessorImpl.invoke(
>>> NativeMethodAccessorImpl.java:57)
>>>
>>>> at
>>>>>>
>>>>>
>>>>>  sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:43)
>>>
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
>>>>>>
>>>>>
>>>>>
>>>>> Do you have an idea or a simple way to use Naive Bayes against my large
>>>>>
>>>> CSV
>>>
>>>> ?
>>>>>
>>>>> Thanks in advance !
>>>>> --
>>>>> Kévin Moulart
>>>>> GSM France : +33 7 81 06 10 10
>>>>> GSM Belgique : +32 473 85 23 85
>>>>> Téléphone fixe : +32 2 771 88 45
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Kévin Moulart
>>>> GSM France : +33 7 81 06 10 10
>>>> GSM Belgique : +32 473 85 23 85
>>>> Téléphone fixe : +32 2 771 88 45
>>>>
>>>
>>>
>>
>>
>>
>


-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Use Naïve Bayes on a large CSV

Reply via email to