It looks like the probabilities vector returned by AbstractClusteringPolicy.classify() has no non-zero elements. In this case, AbstractClusteringPolicy.select()'s call to AbstractVector.maxValueIndex() is returning -1 and that is causing the exception.

How could this happen? I'm not exactly sure, but consider that the probabilities vector is calculated in AbstractClusteringPolicy.classify() by calling DistanceMeasureCluster.pdf() on each of the prior clusters in b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see how this could ever return zero. Certainly, some of your vectors will match the prior cluster centers exactly (they were sampled from the input) and those values would return pdf==1. Even if the cosine distance was 1 the pdf would be 0.5.

Some things to try:
- Have you verified the contents of your input vectors actually have data in them? - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 contents? - Is it possible to run the sequential version (-xm sequential)? If it is you could run it in a debugger to gain more insight.

Jeff

On 6/4/12 12:05 PM, Pat Ferrel wrote:
Using the CLI to kmeans from several trunk versions I get an error I don't understand. When the job died the b3/canopy-centroids/clusters-0-final contained the random-seeds file generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 had several part files but b3/kmeans-clusters/clusters-1 was empty. When I look through the code from the trace it doesn't make much sense.

Command line:
mahout kmeans
  -i b3/vectors/tfidf-vectors/
  -k 20
  -c b3/canopy-centroids/clusters-0-final
  -cl
  -o b3/kmeans-clusters
  -ow
  -cd 0.01
  -x 30
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure

Error:
12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], --convergenceDelta=[0.01], --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], --maxIter=[30], --method=[mapreduce], --numClusters=[20], --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info from SCDynamicStore 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting b3/canopy-centroids/clusters-0-final 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to b3/canopy-centroids/clusters-0-final/part-randomSeed 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: b3/vectors/tfidf-vectors Clusters In: b3/canopy-centroids/clusters-0-final/part-randomSeed Out: b3/kmeans-clusters Distance: org.apache.mahout.common.distance.CosineDistanceMeasure 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
Cluster Iterator running iteration 1 over priorPath: b3/kmeans-clusters/clusters-0 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to process : 1
12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
org.apache.mahout.math.IndexException: Index -1 is outside allowable range of [0,20)
    at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
at org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44) at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52) at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.InterruptedException: Cluster Iteration 1 failed processing b3/kmeans-clusters/clusters-1 at org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)






Reply via email to