To convert input CSV to vectors, u can either: a) Use CSVIterator b) use InputDriver
Either of the above should generate vectors from input CSV that could then be fed into Mahout classifier/clustering jobs. On Thursday, February 20, 2014 5:57 AM, Kevin Moulart <kevinmoul...@gmail.com> wrote: Hi I'm trying to apply a Naive Bayes Classifier to a large CSV file from the command line. I know I have to feed the classifier with a seq file, so I tried to put my csv into one using the command seqdirectory, but even when I try with a really small csv (less than 100Mo) I instantly get an outOfMemoryException from java heap space : mahout seqdirectory -i "/user/cacf/Echant/testSeq" -o "/user/cacf/resSeq" > -ow > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop > and HADOOP_CONF_DIR=/etc/hadoop/conf > MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar > 14/02/20 11:47:22 INFO common.AbstractJob: Command line arguments: > {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], > --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], > --input=[/user/cacf/Echant/testSeq], --keyPrefix=[], > --output=[/user/cacf/resSeq], --overwrite=null, --startPhase=[0], > --tempDir=[temp]} > 14/02/20 11:47:22 INFO common.HadoopUtil: Deleting /user/cacf/resSeq > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:2367) > at > java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114) > at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415) > at java.lang.StringBuilder.append(StringBuilder.java:132) > at > org.apache.mahout.text.PrefixAdditionFilter.process(PrefixAdditionFilter.java:62) > at > org.apache.mahout.text.SequenceFilesFromDirectoryFilter.accept(SequenceFilesFromDirectoryFilter.java:90) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1468) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1502) > at > org.apache.mahout.text.SequenceFilesFromDirectory.run(SequenceFilesFromDirectory.java:98) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at > org.apache.mahout.text.SequenceFilesFromDirectory.main(SequenceFilesFromDirectory.java:53) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:196) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.util.RunJar.main(RunJar.java:208) Do you have an idea or a simple way to use Naive Bayes against my large CSV ? Thanks in advance ! -- Kévin Moulart GSM France : +33 7 81 06 10 10 GSM Belgique : +32 473 85 23 85 Téléphone fixe : +32 2 771 88 45