Re: Use Naïve Bayes on a large CSV

Sebastian Schelter Mon, 24 Feb 2014 03:00:07 -0800

NaiveBayes expects a SequenceFile as input. The key is the class labelas Text, the value are the features as VectorWritable.


--sebastian


On 02/24/2014 11:51 AM, Kevin Moulart wrote:

Hi again,
I finally set my mind on going through java to make a sequence file for the
naive bayes,
but I still can't manage to find anyplace stating exactly what should be in
the sequence file
for mahout to process it with Naive Bayes.

I tried virtually every piece of code i found related to this subject, with
no luck.

My CSV file is like this :
Label that I want to predict, feature 1, feature 2, ..., feature 1628

Could someone tell me exactly what Naive Bayes training procedure expects ?


2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>:

This relates to a previous question I have:  Does mahout have a concept of
adapters which allow us to read data csv style data with filters to create
exact format  for its various inputs (i.e. Recommender three column
format).?  If not is it worth a jira?

On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]>

wrote:


Hi and thanks !

What about the command line, is there a way to do that using the existing
command line ?




2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>:

To convert input CSV to vectors, u can either:

a) Use CSVIterator
b) use InputDriver

Either of the above should generate vectors from input CSV that could

then

be fed into Mahout classifier/clustering jobs.





On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
[email protected]> wrote:

Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
the command line.

I know I have to feed the classifier with a seq file, so I tried to put

my

csv into one using the command seqdirectory, but even when I try with a
really small csv (less than 100Mo) I instantly get an

outOfMemoryException

from java heap space :

mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o

"/user/cacf/resSeq"

-ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using

/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop

and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
--output=[/user/cacf/resSeq],

--overwrite=null, --startPhase=[0],

--tempDir=[temp]}
14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at

java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)

at

java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)

at

java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)

at java.lang.StringBuilder.append(StringBuilder.java:132)
at

org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)

at

org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)

at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
at

org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)

at

org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at

org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at

org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)

at

org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)

at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)



Do you have an idea or a simple way to use Naive Bayes against my large

CSV

?

Thanks in advance !
--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45




--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45

Re: Use Naïve Bayes on a large CSV

Reply via email to