You can run these programs under Eclipse and use breakpoints. On Wed, May 9, 2012 at 2:13 PM, Alex Hasha <[email protected]> wrote: > Hello all, > > We have not been able to get the reuters k-means clustering example to run > without errors on our system for quite a while. We are running hadoop > 0.20.2 on a medium sized cluster, and have installed Mahout 0.6. > > The example shell scripts that were packaged with the release crashed and > burned, so I have been following the step by step instructions for running > k-means on a cluster that are scattered through Chapters 8,9, and 11 of > Mahout In Action. > > In particular, I've manually downloaded > http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz, > unpacked it to examples/reuters, and run > > $ mvn -e -q exec:java > -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" > -Dexec.args="reuters/ reuters-extracted/" > > to extract the raw text files to reuters-extracted. I then uploaded > reuters-extracted/ to HDFS (/user/hadoop/mahout) and ran > > $ bin/mahout seqdirectory -c UTF-8 -i mahout/reuters-extracted/ -o > mahout/reuters-seqfiles > > which seemed to run without error, and > > $bin/mahout seq2sparse -i mahout/reuters-seqfiles/ -o > mahout/reuters-vectors -ow > > which also seemed to run without error. > > There is nontrivial data in the reuters-vectors output directory: > > $ hadoop fs -du mahout/reuters-vectors > Found 7 items > 869751 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/df-count > 824086 > hdfs://master:54310/user/hadoop/mahout/reuters-vectors/dictionary.file-0 > 844593 > hdfs://master:54310/user/hadoop/mahout/reuters-vectors/frequency.file-0 > 17148933 > hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tf-vectors > 16931936 > hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tfidf-vectors > 15098540 > hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tokenized-documents > 1018157 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/wordcount > > And then I run k-means with the following command line: > > $ bin/mahout kmeans -i mahout/reuters-vectors/tfidf-vectors/ -c > mahout/reuters-initial-clusters -o mahout/reuters-kmeans-clusters -dm > org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 > -k 20 -x 20 -cl > > As recommended in Mahout In Action. Here is the output. The error appears > to relate to a problem with the binary format headers of one of the input > files, so my debugging skills are exhausted at this point. If anyone has > solved a similar problem, I would be very appreciative for a hint or two. > > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Running on hadoop, using HADOOP_HOME=/home/hadoop/hadoop-0.20.2 > HADOOP_CONF_DIR=/home/hadoop/hadoop-0.20.2/conf > MAHOUT-JOB: > /home/hadoop/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > 12/05/09 16:42:54 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=mahout/reuters-initial-clusters, > --convergenceDelta=1.0, > --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, > --endPhase=2147483647, --input=mahout/reuters-vectors/tfidf-vectors/, > --maxIter=20, --method=mapreduce, --numClusters=20, > --output=mahout/reuters-kmeans-clusters, --startPhase=0, --tempDir=temp} > 12/05/09 16:42:54 INFO util.NativeCodeLoader: Loaded the native-hadoop > library > 12/05/09 16:42:54 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > 12/05/09 16:42:54 INFO compress.CodecPool: Got brand-new compressor > 12/05/09 16:42:56 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to > mahout/reuters-initial-clusters/part-randomSeed > 12/05/09 16:42:56 INFO kmeans.KMeansDriver: Input: > mahout/reuters-vectors/tfidf-vectors Clusters In: > mahout/reuters-initial-clusters/part-randomSeed Out: > mahout/reuters-kmeans-clusters Distance: > org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure > 12/05/09 16:42:56 INFO kmeans.KMeansDriver: convergence: 1.0 max > Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable > Input Vectors: {} > 12/05/09 16:42:56 INFO kmeans.KMeansDriver: K-Means Iteration 1 > 12/05/09 16:42:58 INFO input.FileInputFormat: Total input paths to process > : 1 > 12/05/09 16:42:58 INFO mapred.JobClient: Running job: job_201205031638_0165 > 12/05/09 16:42:59 INFO mapred.JobClient: map 0% reduce 0% > 12/05/09 16:43:14 INFO mapred.JobClient: Task Id : > attempt_201205031638_0165_m_000000_0, Status : FAILED > java.lang.IllegalArgumentException: Unknown flags set: %d [1000000] > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:115) > at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:86) > at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190) > at > org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:98) > at > org.apache.mahout.clustering.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:53) > at org.apache.mahout.clustering.kmeans.Cluster.readFields(Cluster.java:70) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879) > at > org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76) > at > org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35) > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) > at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) > at > com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) > at > org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:42) > at > org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:57) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > attempt_201205031638_0165_m_000000_0: SLF4J: Class path contains multiple > SLF4J bindings. > attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in > [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] > attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in > [file:/mnt/secondary/hadoop/temp/taskTracker/jobcache/job_201205031638_0165/jars/org/slf4j/impl/StaticLoggerBinder.class] > attempt_201205031638_0165_m_000000_0: SLF4J: See > http://www.slf4j.org/codes.html#multiple_bindings for an explanation. > > Best, > > Alex
-- Lance Norskog [email protected]
