Hi All, I'm having trouble getting the 20News-Groups (https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups, and https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html) example to run.
I've downloaded the data and tried to train the Naive Bayes classifier but I ran the 'trainclassifier' command and got this error message... hadoop@kdevlinux:/usr/local/mahout$ mahout trainclassifier -i examples/bin/work/20news-bydate/bayes-train-input -o examples/bin/work/20news-bydate/bayes-model -type bayes -ng 1 -source hdfs Running on hadoop, using HADOOP_HOME=/usr/local/hadoop No HADOOP_CONF_DIR set, using /usr/local/hadoop/src/conf 11/04/13 09:16:29 WARN driver.MahoutDriver: Unable to add class: org.apache.mahout.utils.eval.InMemoryFactorizationEvaluator 11/04/13 09:16:29 WARN driver.MahoutDriver: Unable to add class: org.apache.mahout.utils.eval.ParallelFactorizationEvaluator 11/04/13 09:16:29 WARN driver.MahoutDriver: Unable to add class: org.apache.mahout.utils.eval.DatasetSplitter 11/04/13 09:16:29 INFO bayes.TrainClassifier: Training Bayes Classifier 11/04/13 09:16:29 INFO bayes.BayesDriver: Reading features... 11/04/13 09:16:30 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/04/13 09:16:31 INFO mapred.FileInputFormat: Total input paths to process : 20 Exception in thread "main" java.lang.IllegalArgumentException: Illegal Capacity: -40 at java.util.ArrayList.<init>(ArrayList.java:110) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:216) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.mahout.classifier.bayes.mapreduce.common.BayesFeatureDriver.runJob (BayesFeatureDriver.java:63) at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver.runJob (BayesDriver.java:47) at org.apache.mahout.classifier.bayes.TrainClassifier.trainNaiveBayes (TrainClassifier.java:54) at org.apache.mahout.classifier.bayes.TrainClassifier.main (TrainClassifier.java:162) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke (ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I thought that maybe I had entered a command wrongly, but then I found the 'build-20news-bayes.sh' shell script, and when I try to run this I get the same exception. I've been running Hadoop 0.20.2 on a 4-node cluster smoothly until now, all are Debian machines using sun-java6-* packages, and I'm running Mahout trunk checked out of the svn repository (svn co http://svn.apache.org/repos/asf/mahout/trunk) today. All the <newsgroup>.txt files seem to have been created and uploaded to HDFS correctly ('hadoop dfs -lsr examples/bin/work'). I'm not sure what to try next. Any help would be very welcome. Ken
