Re: Mahout KMeans Output

Blake Lemoine Fri, 12 Aug 2011 12:14:41 -0700

I'm not exactly sure what it is that you're trying to achieve here.  Are you
interested in extracting an ID for the Vector or an ID for the cluster to
which a Vector belongs.  If what you're interested in is the ID of a cluster
then the key value of the KMeans output file specifies the cluster ID.  If
you're interested in a Vector ID then it's very tricky if you want to use a
plain VectorWritable.  It's much easier to wrap a NamedVector inside of a
VectorWritable.  Then in your mapper(s) and reducer(s) you can cast the
VectorWritable contained in the WeightedVectorWritable as a NamedVector and
extract the name that you originally assigned to the vector.  In the case of
the application I'm working on I'm clustering key phrases so the phrase
itself serves as an excellent name for the vectors.


Hope this was helpful,
Blake Lemoine

On Fri, Aug 12, 2011 at 1:43 PM, Eshwaran Vijaya Kumar <
[email protected]> wrote:

> I am using KMeans as part of a long pipeline. Suppose I give Kmeans a
> SequenceFile containing Key as IntWritable and value as VectorWritable where
> the Keys are IDs for the Vectors, is there a utility or an option to get
> KMeans to spit out the IDs that belong to a cluster rather than the
> WeightedVectorWritable bean?
>
> Thanks
> Esh

Re: Mahout KMeans Output

Reply via email to