Hi Bob,

Cosine distance will return distances on 0.0...1.0 as you suggest. While there is no absolutely foolproof technique for priming canopy T1 & T2 values I recommend you begin by setting T1==T2 and doing a binary search from some initial distance, perhaps 0.1. If you get too few clusters, decrease T1==T2 by half and try again. If too many, double etc.

If you want to be more analytical, use the RandomSeedGenerator to sample from your input vectors and compute a starting point using their inter-cluster distances. You can also skip Canopy and use k-means with -k specified to sample from your input data and produce k clusters. That works pretty well with text and Cosine distance

Once you arrive at a "reasonable" number of clusters, you can mess with T1 to include more points in the centroid calculations but that will not change the number of clusters.


On 5/15/12 10:45 AM, Robert Stewart wrote:
I am trying to run canopy clustering on vectors extracted from lucene index.  I want to 
use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 
distance threshold?  I would assume that Cosine distance measure would return 
"distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I 
know what the potential distance ranges are to pick t1 and t2 (other than many trial and 
errors)?

Thanks
Bob


Reply via email to