Re: Format of K-means clusters from Hadoop

Jeff Eastman Wed, 13 Oct 2010 06:45:02 -0700

Naturally, it happens to me all the time. Here's a link to the k-Meansalgorithm page in the wiki(https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering)where, down in the middle, before Examples is says:


After running the algorithm, the output directory will contain:


  1. clusters-N: directories containing SequenceFiles(Text, Cluster)
     produced by the algorithm for each iteration. The Text /key/ is a
     cluster identifier string.
  2. clusteredPoints: (if --clustering enabled) a directory containing
     SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable
     /key/ is the clusterId. The WeightedVectorWritable /value/ is a
     bean containing a double /weight/ and a VectorWritable /vector/
     where the weight indicates the probability that the vector is a
     member of the cluster. For k-Means clustering, the weights are all
     1.0 since the algorithm selects only a single, most likely cluster
     for each point.

But these things have changed from 0.3 as you observed. We did this toimprove usability and uniformity between the clustering algorithms.



On 10/12/10 11:28 PM, gabeweb wrote:

As per the First Law of Email, as soon as I sent the previous post I figured
it out -- I think.  The index of the pair is the index of the point (I was
saying "user" below, but that's just my use case) being clustered, the key
is the output cluster index, and the value is the original vector associated
with that point (that should have been obvious).  Is that right?

Re: Format of K-means clusters from Hadoop

Reply via email to