+1 To add a little more detail, I suggest you create NamedVectors in a sequence file of [Text, VectorWritable] format. Build your term vectors using your term indices and associated weights and wrap them in a NamedVector with the document name. The Text key will not be used so it does not matter what you provide. Kmeans will process this set of vectors and give you k, clusters of documents. Be sure to include the -cl argument so that you will also get the classified documents produced in the clusterData processing step. The "clusteredPoints" output directory will contain a sequence file [IntWritable, WeightedVectorWritable] where the key is the clusterId and the value is a weighted VectorWritable containing your original NamedVector. Since Kmeans is a maximum likelihood algorithm, the weights will all be 1. Try FuzzyK and Dirichlet too and you will get fractional weights. They both accept the same inputs and produce the same outputs. The ClusterDumper will give you human readable results. See TestClusterDumper unit test for the incantations.
Smooth sailing, Jeff -----Original Message----- From: Hector Yee [mailto:[email protected]] Sent: Monday, June 13, 2011 11:27 PM To: [email protected] Cc: [email protected] Subject: Re: Another beginners question It's just a sequence file so you can just create a sequence file writer and output your own weights Sent from my iPad On Jun 13, 2011, at 9:52 PM, sharath jagannath <[email protected]> wrote: > Hey All, > > I intend to build a KMeans clustering for my documents but does not want to > use the tf/ tf-idf based vectors. > I have a map of <term,weight> associated with the document. > I want to use these weights in place of the term counts for my computation. > Was having a look at DocumentProcessor class, it is primarily driven by the > term count. > wondering whether I need to do something on my own or is there an inbuilt > support for this. > > -- > Thanks, > Sharath
