Format of K-means clusters from Hadoop

gabeweb Tue, 12 Oct 2010 23:10:04 -0700

The format of the output clusters in K-means clustering using KMeansDriver
appears to have changed from 0.3 to 0.4.  In 0.3, once Hadoop is done
running, each call to SequenceFile.Reader.next will return a pair of
(userID, clusterID) mapping a user to the cluster to which that user
belongs.  But in 0.4, the format is different.  The name of the Hadoop
directory changes to "clusteredPoints" (I assume that running
KMeansDriver.run() with runClustering=true is the right thing to do here),
and within that directory, using SequenceFile.Reader.next, the "key" values
are almost certainly cluster IDs, but what are the values?  I would think
they are some sort of cluster representation, but they are of type
WeightedVectorWritable and contain sparse vectors with values like "1.142". 
Furthermore, there is more than one of these for each key.  So basically, I
don't understand this output.  So what is in fact the meaning of these
(IntWritable, WeightedVectorWritable) output pairs?


BTW, am I missing some documentation somewhere that explains this?  I don't
really mind sorting it out myself as it is educational, but if it exists
then I will use it.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Format-of-K-means-clusters-from-Hadoop-tp1692414p1692414.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Format of K-means clusters from Hadoop

Reply via email to