+1 To add a little more detail, I suggest you create NamedVectors in a sequence 
file of [Text, VectorWritable] format. Build your term vectors using your term 
indices and associated weights and wrap them in a NamedVector with the document 
name. The Text key will not be used so it does not matter what you provide. 
Kmeans will process this set of vectors and give you k, clusters of documents. 
Be sure to include the -cl argument so that you will also get the classified 
documents produced in the clusterData processing step. The "clusteredPoints" 
output directory will contain a sequence file [IntWritable, 
WeightedVectorWritable] where the key is the clusterId and the value is a 
weighted VectorWritable containing your original NamedVector. Since Kmeans is a 
maximum likelihood algorithm, the weights will all be 1. Try FuzzyK and 
Dirichlet too and you will get fractional weights. They both accept the same 
inputs and produce the same outputs. The ClusterDumper will give you human 
readable results. See TestClusterDumper unit test for the incantations.

Smooth sailing,
Jeff

-----Original Message-----
From: Hector Yee [mailto:[email protected]] 
Sent: Monday, June 13, 2011 11:27 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Another beginners question

It's just a sequence file so you can just create a sequence file writer and 
output your own weights

Sent from my iPad

On Jun 13, 2011, at 9:52 PM, sharath jagannath <[email protected]> 
wrote:

> Hey All,
> 
> I intend to build a KMeans clustering for my documents but does not want to
> use the tf/ tf-idf based vectors.
> I have a map of <term,weight> associated with the document.
> I want to use these weights in place of the term counts for my computation.
> Was having a look at DocumentProcessor class, it is primarily driven by the
> term count.
> wondering whether I need to do something on my own or is there an inbuilt
> support for this.
> 
> -- 
> Thanks,
> Sharath

Reply via email to