Hi Bob,
Cosine distance will return distances on 0.0...1.0 as you suggest. While
there is no absolutely foolproof technique for priming canopy T1 & T2
values I recommend you begin by setting T1==T2 and doing a binary search
from some initial distance, perhaps 0.1. If you get too few clusters,
decrease T1==T2 by half and try again. If too many, double etc.
If you want to be more analytical, use the RandomSeedGenerator to sample
from your input vectors and compute a starting point using their
inter-cluster distances. You can also skip Canopy and use k-means with
-k specified to sample from your input data and produce k clusters. That
works pretty well with text and Cosine distance
Once you arrive at a "reasonable" number of clusters, you can mess with
T1 to include more points in the centroid calculations but that will not
change the number of clusters.
On 5/15/12 10:45 AM, Robert Stewart wrote:
I am trying to run canopy clustering on vectors extracted from lucene index. I want to
use CosineDistanceMeasure. How do I know what appropriate values to use for t1 and t2
distance threshold? I would assume that Cosine distance measure would return
"distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I
know what the potential distance ranges are to pick t1 and t2 (other than many trial and
errors)?
Thanks
Bob