Hi Neil

I had a similar problem and managed to solve it for me.  See 
http://comments.gmane.org/gmane.comp.apache.mahout.user/10228.


R


________________________________
 From: Neil Chaudhuri <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Tuesday, 6 December 2011, 0:02
Subject: MeanShiftCanopyDriver Output
 
I am attempting to programmatically run MeanShiftCanopyDriver. I found this 
note about the output:


After running the algorithm, the output directory will contain:

1.  clusters-N: directories containing SequenceFiles(Text, MeanShiftCanopy) 
produced by the algorithm for each iteration. The Text key is a cluster 
identifier string.
2.  clusteredPoints: (if runClustering enabled) a directory containing 
SequenceFile(IntWritable, WeightedVectorWritable). The IntWritable key is the 
clusterId. The WeightedVectorWritable value is a bean containing a double 
weight and a VectorWritable vector where the weight indicates the probability 
that the vector is a member of the cluster. As Mean Shift only produces a 
single clustering for each point, the weights are all == 1.

It seems like I can only expect to find a sequence file of clusterIds mapped to 
Vectors. I am lost as to where I can find a reference (perhaps an id) to the 
original documents being clustered. In other words, how can I map the output 
back to the input?

Thanks.

Reply via email to