You will need to wrap your input vectors in a NamedVector, using your document ids as the names. These will pass through the clustering process and you will be able to map each clustered vector back to your input that way.

On 12/5/11 5:02 PM, Neil Chaudhuri wrote:
I am attempting to programmatically run MeanShiftCanopyDriver. I found this 
note about the output:


After running the algorithm, the output directory will contain:

  1.  clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) 
produced by the algorithm for each iteration. The Text key is a cluster 
identifier string.
  2.  clusteredPoints: (if runClustering enabled) a directory containing 
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the 
clusterId. The WeightedVectorWritable value is a bean containing a double 
weight and a VectorWritable vector where the weight indicates the probability 
that the vector is a member of the cluster. As Mean Shift only produces a 
single clustering for each point, the weights are all == 1.

It seems like I can only expect to find a sequence file of clusterIds mapped to 
Vectors. I am lost as to where I can find a reference (perhaps an id) to the 
original documents being clustered. In other words, how can I map the output 
back to the input?

Thanks.


Reply via email to