Hi and thanks ! What about the command line, is there a way to do that using the existing command line ?
2014-02-20 12:02 GMT+01:00 Suneel Marthi <[email protected]>: > To convert input CSV to vectors, u can either: > > a) Use CSVIterator > b) use InputDriver > > Either of the above should generate vectors from input CSV that could then > be fed into Mahout classifier/clustering jobs. > > > > > > On Thursday, February 20, 2014 5:57 AM, Kevin Moulart < > [email protected]> wrote: > > Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from > the command line. > > I know I have to feed the classifier with a seq file, so I tried to put my > csv into one using the command seqdirectory, but even when I try with a > really small csv (less than 100Mo) I instantly get an outOfMemoryException > from java heap space : > > mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o "/user/cacf/resSeq" > > -ow > > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > > Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop > > and HADOOP_CONF_DIR=/etc/hadoop/conf > > MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar > > 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments: > > {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], > > --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], > > --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], > > --output=[/user/cacf/resSeq], > --overwrite=null, --startPhase=[0], > > --tempDir=[temp]} > > 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > at java.util.Arrays.copyOf(Arrays.java:2367) > > at > > > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) > > at > > > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) > > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) > > at java.lang.StringBuilder.append(StringBuilder.java:132) > > at > > > org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62) > > at > > > org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90) > > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) > > at > org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) > > at > > > org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > > at > > > org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > > at > org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) > > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at org.apache.hadoop.util.RunJar.main(RunJar.java:208) > > > Do you have an idea or a simple way to use Naive Bayes against my large CSV > ? > > Thanks in advance ! > -- > Kévin Moulart > GSM France : +33 7 81 06 10 10 > GSM Belgique : +32 473 85 23 85 > Téléphone fixe : +32 2 771 88 45 > -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45
