Re: Having a devil of a time running k-means examples with Mahout 0.6 / Hadoop 0.20.2

Lance Norskog Wed, 09 May 2012 23:37:29 -0700

You can run these programs under Eclipse and use breakpoints.

On Wed, May 9, 2012 at 2:13 PM, Alex Hasha <[email protected]> wrote:
> Hello all,
>
> We have not been able to get the reuters k-means clustering example to run
> without errors on our system for quite a while.  We are running hadoop
> 0.20.2 on a medium sized cluster, and have installed Mahout 0.6.
>
> The example shell scripts that were packaged with the release crashed and
> burned, so I have been following the step by step instructions for running
> k-means on a cluster that are scattered through Chapters 8,9, and 11 of
> Mahout In Action.
>
> In particular, I've manually downloaded
> http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz,
> unpacked it to examples/reuters, and run
>
> $ mvn -e -q exec:java
> -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
> -Dexec.args="reuters/ reuters-extracted/"
>
> to extract the raw text files to reuters-extracted.  I then uploaded
> reuters-extracted/ to HDFS (/user/hadoop/mahout) and ran
>
> $ bin/mahout seqdirectory -c UTF-8 -i mahout/reuters-extracted/ -o
> mahout/reuters-seqfiles
>
> which seemed to run without error, and
>
> $bin/mahout seq2sparse -i mahout/reuters-seqfiles/ -o
> mahout/reuters-vectors -ow
>
> which also seemed to run without error.
>
> There is nontrivial data in the reuters-vectors output directory:
>
> $ hadoop fs -du mahout/reuters-vectors
> Found 7 items
> 869751      hdfs://master:54310/user/hadoop/mahout/reuters-vectors/df-count
> 824086
>  hdfs://master:54310/user/hadoop/mahout/reuters-vectors/dictionary.file-0
> 844593
>  hdfs://master:54310/user/hadoop/mahout/reuters-vectors/frequency.file-0
> 17148933
>  hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tf-vectors
> 16931936
>  hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tfidf-vectors
> 15098540
>  hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tokenized-documents
> 1018157     hdfs://master:54310/user/hadoop/mahout/reuters-vectors/wordcount
>
> And then I run k-means with the following command line:
>
> $ bin/mahout kmeans -i mahout/reuters-vectors/tfidf-vectors/ -c
> mahout/reuters-initial-clusters -o mahout/reuters-kmeans-clusters -dm
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
> -k 20 -x 20 -cl
>
> As recommended in Mahout In Action.  Here is the output.  The error appears
> to relate to a problem with the binary format headers of one of the input
> files, so my debugging skills are exhausted at this point.  If anyone has
> solved a similar problem, I would be very appreciative for a hint or two.
>
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using HADOOP_HOME=/home/hadoop/hadoop-0.20.2
> HADOOP_CONF_DIR=/home/hadoop/hadoop-0.20.2/conf
> MAHOUT-JOB:
> /home/hadoop/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> 12/05/09 16:42:54 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=mahout/reuters-initial-clusters,
> --convergenceDelta=1.0,
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647, --input=mahout/reuters-vectors/tfidf-vectors/,
> --maxIter=20, --method=mapreduce, --numClusters=20,
> --output=mahout/reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
> 12/05/09 16:42:54 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 12/05/09 16:42:54 INFO zlib.ZlibFactory: Successfully loaded & initialized
> native-zlib library
> 12/05/09 16:42:54 INFO compress.CodecPool: Got brand-new compressor
> 12/05/09 16:42:56 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
> mahout/reuters-initial-clusters/part-randomSeed
> 12/05/09 16:42:56 INFO kmeans.KMeansDriver: Input:
> mahout/reuters-vectors/tfidf-vectors Clusters In:
> mahout/reuters-initial-clusters/part-randomSeed Out:
> mahout/reuters-kmeans-clusters Distance:
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
> 12/05/09 16:42:56 INFO kmeans.KMeansDriver: convergence: 1.0 max
> Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable
> Input Vectors: {}
> 12/05/09 16:42:56 INFO kmeans.KMeansDriver: K-Means Iteration 1
> 12/05/09 16:42:58 INFO input.FileInputFormat: Total input paths to process
> : 1
> 12/05/09 16:42:58 INFO mapred.JobClient: Running job: job_201205031638_0165
> 12/05/09 16:42:59 INFO mapred.JobClient:  map 0% reduce 0%
> 12/05/09 16:43:14 INFO mapred.JobClient: Task Id :
> attempt_201205031638_0165_m_000000_0, Status : FAILED
> java.lang.IllegalArgumentException: Unknown flags set: %d [1000000]
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
> at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:86)
> at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190)
> at
> org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:98)
> at
> org.apache.mahout.clustering.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:53)
> at org.apache.mahout.clustering.kmeans.Cluster.readFields(Cluster.java:70)
> at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
> at
> org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
> at
> org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
> at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
> at
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
> at
> org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:42)
> at
> org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:57)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> attempt_201205031638_0165_m_000000_0: SLF4J: Class path contains multiple
> SLF4J bindings.
> attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in
> [jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in
> [file:/mnt/secondary/hadoop/temp/taskTracker/jobcache/job_201205031638_0165/jars/org/slf4j/impl/StaticLoggerBinder.class]
> attempt_201205031638_0165_m_000000_0: SLF4J: See
> http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
>
> Best,
>
> Alex




-- 
Lance Norskog
[email protected]

Re: Having a devil of a time running k-means examples with Mahout 0.6 / Hadoop 0.20.2

Reply via email to