Re: MeanShiftCanopyDriver Output

Jeff Eastman Tue, 06 Dec 2011 09:11:34 -0800

You will need to wrap your input vectors in a NamedVector, using yourdocument ids as the names. These will pass through the clusteringprocess and you will be able to map each clustered vector back to yourinput that way.


On 12/5/11 5:02 PM, Neil Chaudhuri wrote:

I am attempting to programmatically run MeanShiftCanopyDriver. I found this 
note about the output:



After running the algorithm, the output directory will contain:

  1.  clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) 
produced by the algorithm for each iteration. The Text key is a cluster 
identifier string.
  2.  clusteredPoints: (if runClustering enabled) a directory containing 
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the 
clusterId. The WeightedVectorWritable value is a bean containing a double 
weight and a VectorWritable vector where the weight indicates the probability 
that the vector is a member of the cluster. As Mean Shift only produces a 
single clustering for each point, the weights are all == 1.

It seems like I can only expect to find a sequence file of clusterIds mapped to 
Vectors. I am lost as to where I can find a reference (perhaps an id) to the 
original documents being clustered. In other words, how can I map the output 
back to the input?

Thanks.

Re: MeanShiftCanopyDriver Output

Reply via email to