You will need to wrap your input vectors in a NamedVector, using your
document ids as the names. These will pass through the clustering
process and you will be able to map each clustered vector back to your
input that way.
On 12/5/11 5:02 PM, Neil Chaudhuri wrote:
I am attempting to programmatically run MeanShiftCanopyDriver. I found this
note about the output:
After running the algorithm, the output directory will contain:
1. clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy)
produced by the algorithm for each iteration. The Text key is a cluster
identifier string.
2. clusteredPoints: (if runClustering enabled) a directory containing
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the
clusterId. The WeightedVectorWritable value is a bean containing a double
weight and a VectorWritable vector where the weight indicates the probability
that the vector is a member of the cluster. As Mean Shift only produces a
single clustering for each point, the weights are all == 1.
It seems like I can only expect to find a sequence file of clusterIds mapped to
Vectors. I am lost as to where I can find a reference (perhaps an id) to the
original documents being clustered. In other words, how can I map the output
back to the input?
Thanks.