I am attempting to programmatically run MeanShiftCanopyDriver. I found this 
note about the output:


After running the algorithm, the output directory will contain:

 1.  clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) 
produced by the algorithm for each iteration. The Text key is a cluster 
identifier string.
 2.  clusteredPoints: (if runClustering enabled) a directory containing 
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the 
clusterId. The WeightedVectorWritable value is a bean containing a double 
weight and a VectorWritable vector where the weight indicates the probability 
that the vector is a member of the cluster. As Mean Shift only produces a 
single clustering for each point, the weights are all == 1.

It seems like I can only expect to find a sequence file of clusterIds mapped to 
Vectors. I am lost as to where I can find a reference (perhaps an id) to the 
original documents being clustered. In other words, how can I map the output 
back to the input?

Thanks.

Reply via email to