I am attempting to programmatically run MeanShiftCanopyDriver. I found this note about the output:
After running the algorithm, the output directory will contain: 1. clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) produced by the algorithm for each iteration. The Text key is a cluster identifier string. 2. clusteredPoints: (if runClustering enabled) a directory containing SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the clusterId. The WeightedVectorWritable value is a bean containing a double weight and a VectorWritable vector where the weight indicates the probability that the vector is a member of the cluster. As Mean Shift only produces a single clustering for each point, the weights are all == 1. It seems like I can only expect to find a sequence file of clusterIds mapped to Vectors. I am lost as to where I can find a reference (perhaps an id) to the original documents being clustered. In other words, how can I map the output back to the input? Thanks.
