This relates to a previous question I have: Does mahout have a concept of
adapters which allow us to read data csv style data with filters to create
exact format for its various inputs (i.e. Recommender three column
format).? If not is it worth a jira?
On Feb 20, 2014, at 7:50 AM, Kevin Moulart <[email protected]>
wrote:
Hi and thanks !
What about the command line, is there a way to do that using the existing
command line ?
2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>:
To convert input CSV to vectors, u can either:
a) Use CSVIterator
b) use InputDriver
Either of the above should generate vectors from input CSV that could
then
be fed into Mahout classifier/clustering jobs.
On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <
[email protected]> wrote:
Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from
the command line.
I know I have to feed the classifier with a seq file, so I tried to put
my
csv into one using the command seqdirectory, but even when I try with a
really small csv (less than 100Mo) I instantly get an
outOfMemoryException
from java heap space :
mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o
"/user/cacf/resSeq"
-ow
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
/opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments:
{--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647],
--fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter],
--input=[/user/cacf/Echant/testSeq], --keyPrefix=[],
--output=[/user/cacf/resSeq],
--overwrite=null, --startPhase=[0],
--tempDir=[temp]}
14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuilder.append(StringBuilder.java:132)
at
org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62)
at
org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468)
at
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502)
at
org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at
org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Do you have an idea or a simple way to use Naive Bayes against my large
CSV
?
Thanks in advance !
--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45
--
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45