It looks like the probabilities vector returned by
AbstractClusteringPolicy.classify() has no non-zero elements. In this
case, AbstractClusteringPolicy.select()'s call to
AbstractVector.maxValueIndex() is returning -1 and that is causing the
exception.
How could this happen? I'm not exactly sure, but consider that the
probabilities vector is calculated in
AbstractClusteringPolicy.classify() by calling
DistanceMeasureCluster.pdf() on each of the prior clusters in
b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see
how this could ever return zero. Certainly, some of your vectors will
match the prior cluster centers exactly (they were sampled from the
input) and those values would return pdf==1. Even if the cosine distance
was 1 the pdf would be 0.5.
Some things to try:
- Have you verified the contents of your input vectors actually have
data in them?
- Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0
contents?
- Is it possible to run the sequential version (-xm sequential)? If it
is you could run it in a debugger to gain more insight.
Jeff
On 6/4/12 12:05 PM, Pat Ferrel wrote:
Using the CLI to kmeans from several trunk versions I get an error I
don't understand. When the job died the
b3/canopy-centroids/clusters-0-final contained the random-seeds file
generated by the kmeans driver and the b3/kmeans-clusters/clusters-0
had several part files but b3/kmeans-clusters/clusters-1 was empty.
When I look through the code from the trace it doesn't make much sense.
Command line:
mahout kmeans
-i b3/vectors/tfidf-vectors/
-k 20
-c b3/canopy-centroids/clusters-0-final
-cl
-o b3/kmeans-clusters
-ow
-cd 0.01
-x 30
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
Error:
12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final],
--convergenceDelta=[0.01],
--distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
--endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/],
--maxIter=[30], --method=[mapreduce], --numClusters=[20],
--output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0],
--tempDir=[temp]}
2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info
from SCDynamicStore
12/06/04 07:55:03 INFO common.HadoopUtil: Deleting
b3/canopy-centroids/clusters-0-final
12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
b3/canopy-centroids/clusters-0-final/part-randomSeed
12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input:
b3/vectors/tfidf-vectors Clusters In:
b3/canopy-centroids/clusters-0-final/part-randomSeed Out:
b3/kmeans-clusters Distance:
org.apache.mahout.common.distance.CosineDistanceMeasure
12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max
Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
Cluster Iterator running iteration 1 over priorPath:
b3/kmeans-clusters/clusters-0
12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to
process : 1
12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0%
12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
org.apache.mahout.math.IndexException: Index -1 is outside allowable
range of [0,20)
at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
at
org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
at
org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
at
org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001
12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.InterruptedException: Cluster
Iteration 1 failed processing b3/kmeans-clusters/clusters-1
at
org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at
org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)