Hello all, We have not been able to get the reuters k-means clustering example to run without errors on our system for quite a while. We are running hadoop 0.20.2 on a medium sized cluster, and have installed Mahout 0.6.
The example shell scripts that were packaged with the release crashed and burned, so I have been following the step by step instructions for running k-means on a cluster that are scattered through Chapters 8,9, and 11 of Mahout In Action. In particular, I've manually downloaded http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz, unpacked it to examples/reuters, and run $ mvn -e -q exec:java -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" -Dexec.args="reuters/ reuters-extracted/" to extract the raw text files to reuters-extracted. I then uploaded reuters-extracted/ to HDFS (/user/hadoop/mahout) and ran $ bin/mahout seqdirectory -c UTF-8 -i mahout/reuters-extracted/ -o mahout/reuters-seqfiles which seemed to run without error, and $bin/mahout seq2sparse -i mahout/reuters-seqfiles/ -o mahout/reuters-vectors -ow which also seemed to run without error. There is nontrivial data in the reuters-vectors output directory: $ hadoop fs -du mahout/reuters-vectors Found 7 items 869751 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/df-count 824086 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/dictionary.file-0 844593 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/frequency.file-0 17148933 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tf-vectors 16931936 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tfidf-vectors 15098540 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tokenized-documents 1018157 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/wordcount And then I run k-means with the following command line: $ bin/mahout kmeans -i mahout/reuters-vectors/tfidf-vectors/ -c mahout/reuters-initial-clusters -o mahout/reuters-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl As recommended in Mahout In Action. Here is the output. The error appears to relate to a problem with the binary format headers of one of the input files, so my debugging skills are exhausted at this point. If anyone has solved a similar problem, I would be very appreciative for a hint or two. MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/home/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/home/hadoop/hadoop-0.20.2/conf MAHOUT-JOB: /home/hadoop/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 12/05/09 16:42:54 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=mahout/reuters-initial-clusters, --convergenceDelta=1.0, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=mahout/reuters-vectors/tfidf-vectors/, --maxIter=20, --method=mapreduce, --numClusters=20, --output=mahout/reuters-kmeans-clusters, --startPhase=0, --tempDir=temp} 12/05/09 16:42:54 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/05/09 16:42:54 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/05/09 16:42:54 INFO compress.CodecPool: Got brand-new compressor 12/05/09 16:42:56 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to mahout/reuters-initial-clusters/part-randomSeed 12/05/09 16:42:56 INFO kmeans.KMeansDriver: Input: mahout/reuters-vectors/tfidf-vectors Clusters In: mahout/reuters-initial-clusters/part-randomSeed Out: mahout/reuters-kmeans-clusters Distance: org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure 12/05/09 16:42:56 INFO kmeans.KMeansDriver: convergence: 1.0 max Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} 12/05/09 16:42:56 INFO kmeans.KMeansDriver: K-Means Iteration 1 12/05/09 16:42:58 INFO input.FileInputFormat: Total input paths to process : 1 12/05/09 16:42:58 INFO mapred.JobClient: Running job: job_201205031638_0165 12/05/09 16:42:59 INFO mapred.JobClient: map 0% reduce 0% 12/05/09 16:43:14 INFO mapred.JobClient: Task Id : attempt_201205031638_0165_m_000000_0, Status : FAILED java.lang.IllegalArgumentException: Unknown flags set: %d [1000000] at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115) at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:86) at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190) at org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:98) at org.apache.mahout.clustering.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:53) at org.apache.mahout.clustering.kmeans.Cluster.readFields(Cluster.java:70) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76) at org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) at org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:42) at org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:57) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) attempt_201205031638_0165_m_000000_0: SLF4J: Class path contains multiple SLF4J bindings. attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in [file:/mnt/secondary/hadoop/temp/taskTracker/jobcache/job_201205031638_0165/jars/org/slf4j/impl/StaticLoggerBinder.class] attempt_201205031638_0165_m_000000_0: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Best, Alex
