NaiveBayes expects a SequenceFile as input. The key is the class label as Text, the value are the features as VectorWritable.

--sebastian

On 02/24/2014 11:51 AM, Kevin Moulart wrote:
Hi again,
I finally set my mind on going through java to make a sequence file for the
naive bayes,
but I still can't manage to find anyplace stating exactly what should be in
the sequence file
for mahout to process it with Naive Bayes.

I tried virtually every piece of code i found related to this subject, with
no luck.

My CSV file is like this :
Label that I want to predict, feature 1, feature 2, ..., feature 1628

Could someone tell me exactly what Naive Bayes training procedure expects ?


2014-02-20 13:56 GMT+01:00 Jay Vyas <[email protected]>:

This relates to a previous question I have:  Does mahout have a concept of
adapters which allow us to read data csv style data with filters to create
exact format  for its various inputs (i.e. Recommender three column
format).?  If not is it worth a jira?


On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]>
wrote:

Hi and thanks !

What about the command line, is there a way to do that using the existing
command line ?




2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>:

To convert input CSV to vectors, u can either:

a) Use CSVIterator
b) use InputDriver

Either of the above should generate vectors from input CSV that could
then
be fed into Mahout classifier/clustering jobs.





On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
[email protected]> wrote:

Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
the command line.

I know I have to feed the classifier with a seq file, so I tried to put
my
csv into one using the command seqdirectory, but even when I try with a
really small csv (less than 100Mo) I instantly get an
outOfMemoryException
from java heap space :

mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
"/user/cacf/resSeq"
-ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
--output=[/user/cacf/resSeq],
--overwrite=null, --startPhase=[0],
--tempDir=[temp]}
14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at

java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at

java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at

org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)
at

org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
at
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
at

org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at

org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at

org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)


Do you have an idea or a simple way to use Naive Bayes against my large
CSV
?

Thanks in advance !
--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45



--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45





Reply via email to