I'm not exactly sure what it is that you're trying to achieve here. Are you interested in extracting an ID for the Vector or an ID for the cluster to which a Vector belongs. If what you're interested in is the ID of a cluster then the key value of the KMeans output file specifies the cluster ID. If you're interested in a Vector ID then it's very tricky if you want to use a plain VectorWritable. It's much easier to wrap a NamedVector inside of a VectorWritable. Then in your mapper(s) and reducer(s) you can cast the VectorWritable contained in the WeightedVectorWritable as a NamedVector and extract the name that you originally assigned to the vector. In the case of the application I'm working on I'm clustering key phrases so the phrase itself serves as an excellent name for the vectors.
Hope this was helpful, Blake Lemoine On Fri, Aug 12, 2011 at 1:43 PM, Eshwaran Vijaya Kumar < [email protected]> wrote: > I am using KMeans as part of a long pipeline. Suppose I give Kmeans a > SequenceFile containing Key as IntWritable and value as VectorWritable where > the Keys are IDs for the Vectors, is there a utility or an option to get > KMeans to spit out the IDs that belong to a cluster rather than the > WeightedVectorWritable bean? > > Thanks > Esh
