Naturally, it happens to me all the time. Here's a link to the k-Means
algorithm page in the wiki
(https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering)
where, down in the middle, before Examples is says:
After running the algorithm, the output directory will contain:
1. clusters-N: directories containing SequenceFiles(Text, Cluster)
produced by the algorithm for each iteration. The Text /key/ is a
cluster identifier string.
2. clusteredPoints: (if --clustering enabled) a directory containing
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable
/key/ is the clusterId. The WeightedVectorWritable /value/ is a
bean containing a double /weight/ and a VectorWritable /vector/
where the weight indicates the probability that the vector is a
member of the cluster. For k-Means clustering, the weights are all
1.0 since the algorithm selects only a single, most likely cluster
for each point.
But these things have changed from 0.3 as you observed. We did this to
improve usability and uniformity between the clustering algorithms.
On 10/12/10 11:28 PM, gabeweb wrote:
As per the First Law of Email, as soon as I sent the previous post I figured
it out -- I think. The index of the pair is the index of the point (I was
saying "user" below, but that's just my use case) being clustered, the key
is the output cluster index, and the value is the original vector associated
with that point (that should have been obvious). Is that right?