The format of the output clusters in K-means clustering using KMeansDriver appears to have changed from 0.3 to 0.4. In 0.3, once Hadoop is done running, each call to SequenceFile.Reader.next will return a pair of (userID, clusterID) mapping a user to the cluster to which that user belongs. But in 0.4, the format is different. The name of the Hadoop directory changes to "clusteredPoints" (I assume that running KMeansDriver.run() with runClustering=true is the right thing to do here), and within that directory, using SequenceFile.Reader.next, the "key" values are almost certainly cluster IDs, but what are the values? I would think they are some sort of cluster representation, but they are of type WeightedVectorWritable and contain sparse vectors with values like "1.142". Furthermore, there is more than one of these for each key. So basically, I don't understand this output. So what is in fact the meaning of these (IntWritable, WeightedVectorWritable) output pairs?
BTW, am I missing some documentation somewhere that explains this? I don't really mind sorting it out myself as it is educational, but if it exists then I will use it. -- View this message in context: http://lucene.472066.n3.nabble.com/Format-of-K-means-clusters-from-Hadoop-tp1692414p1692414.html Sent from the Mahout User List mailing list archive at Nabble.com.
