Hi folks,
I'm trying to run the Wikipedia Bayes Example and I got stuck at step 8. Train the classifier: $MAHOUT_HOME/bin/mahout trainclassifier -i wikipediainput -o wikipediamodel [from https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example] All steps before worked just fine. I downloaded the mahout-distribution-0.4.zip -file and I'm running it with Hadoop on Ubuntu 10.04. That's the exception I get: Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/david/wikipediamodel/trainer-termDocCount Input path does not exist: hdfs://localhost:9000/user/david/wikipediamodel/trainer-wordFreq Input path does not exist: hdfs://localhost:9000/user/david/wikipediamodel/trainer-featureCount I tried to created those folders manually with mkdir from the hadoop-shell-commands-page, but when I re-run the Train the classifier-command, the wikipediamodel where deleted and recreated, but without the trainer-*-folders. I'm not using the full wikipedia data, because 27GB (+chunk-files) is too much for my HDD, I'm using the chunk-000*.xml-Files provided in the sources from [http://www.ibm.com/developerworks/java/library/j-mahout/]. I do hope that this is not the reason... Thanks and regards, David Ps: This is the full output: da...@david-lenovotop:~$ $MAHOUT_HOME/bin/mahout trainclassifier -i wikipediainput -o wikipediamodel Running on hadoop, using HADOOP_HOME=/home/david/Programme/hadoop HADOOP_CONF_DIR=/home/david/Programme/hadoop/conf 10/11/16 16:37:36 INFO bayes.TrainClassifier: Training Bayes Classifier 10/11/16 16:37:36 INFO common.HadoopUtil: Deleting wikipediamodel 10/11/16 16:37:36 INFO bayes.BayesDriver: Reading features... 10/11/16 16:37:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/11/16 16:37:37 INFO mapred.FileInputFormat: Total input paths to process : 1 10/11/16 16:37:37 INFO mapred.JobClient: Running job: job_201011161633_0003 10/11/16 16:37:38 INFO mapred.JobClient: map 0% reduce 0% 10/11/16 16:37:47 INFO mapred.JobClient: map 100% reduce 0% 10/11/16 16:37:56 INFO mapred.JobClient: map 100% reduce 100% 10/11/16 16:37:58 INFO mapred.JobClient: Job complete: job_201011161633_0003 10/11/16 16:37:58 INFO mapred.JobClient: Counters: 15 10/11/16 16:37:58 INFO mapred.JobClient: Job Counters 10/11/16 16:37:58 INFO mapred.JobClient: Launched reduce tasks=1 10/11/16 16:37:58 INFO mapred.JobClient: Launched map tasks=1 10/11/16 16:37:58 INFO mapred.JobClient: FileSystemCounters 10/11/16 16:37:58 INFO mapred.JobClient: FILE_BYTES_READ=6 10/11/16 16:37:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44 10/11/16 16:37:58 INFO mapred.JobClient: Map-Reduce Framework 10/11/16 16:37:58 INFO mapred.JobClient: Reduce input groups=0 10/11/16 16:37:58 INFO mapred.JobClient: Combine output records=0 10/11/16 16:37:58 INFO mapred.JobClient: Map input records=0 10/11/16 16:37:58 INFO mapred.JobClient: Reduce shuffle bytes=6 10/11/16 16:37:58 INFO mapred.JobClient: Reduce output records=0 10/11/16 16:37:58 INFO mapred.JobClient: Spilled Records=0 10/11/16 16:37:58 INFO mapred.JobClient: Map output bytes=0 10/11/16 16:37:58 INFO mapred.JobClient: Map input bytes=0 10/11/16 16:37:58 INFO mapred.JobClient: Combine input records=0 10/11/16 16:37:58 INFO mapred.JobClient: Map output records=0 10/11/16 16:37:58 INFO mapred.JobClient: Reduce input records=0 10/11/16 16:37:58 INFO bayes.BayesDriver: Calculating Tf-Idf... 10/11/16 16:37:58 INFO common.BayesTfIdfDriver: Counts of documents in Each Label 10/11/16 16:37:58 INFO common.BayesTfIdfDriver: {} 10/11/16 16:37:58 INFO common.BayesTfIdfDriver: {dataSource=hdfs, alpha_i=1.0, minDf=1, gramSize=1} 10/11/16 16:37:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/user/david/wikipediamodel/trainer-termDocCount Input path does not exist: hdfs://localhost:9000/user/david/wikipediamodel/trainer-wordFreq Input path does not exist: hdfs://localhost:9000/user/david/wikipediamodel/trainer-featureCount at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190 ) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInpu tFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.mahout.classifier.bayes.mapreduce.common.BayesTfIdfDriver.runJob( BayesTfIdfDriver.java:130) at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver.runJob(BayesD river.java:49) at org.apache.mahout.classifier.bayes.TrainClassifier.trainNaiveBayes(TrainClas sifier.java:54) at org.apache.mahout.classifier.bayes.TrainClassifier.main(TrainClassifier.java :162) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver .java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
