Thanks Jeff. I do see that cosine distance does return 0.0-1.0 now as expected. Something else was wrong in my initial run I guess.
A different question about k-means: I can successfully cluster using k-means but what happens is some clusters are very unrelated, so it seems like there needs to be some distance threshold to cluster documents using k-means (so clusters with very dis-similar items just dont get put into any cluster). Is that possible with mahout? I dont see any type of threshold parameters for k-means. On May 15, 2012, at 11:16 AM, Jeff Eastman wrote: > Hi Bob, > > Cosine distance will return distances on 0.0...1.0 as you suggest. While > there is no absolutely foolproof technique for priming canopy T1 & T2 values > I recommend you begin by setting T1==T2 and doing a binary search from some > initial distance, perhaps 0.1. If you get too few clusters, decrease T1==T2 > by half and try again. If too many, double etc. > > If you want to be more analytical, use the RandomSeedGenerator to sample from > your input vectors and compute a starting point using their inter-cluster > distances. You can also skip Canopy and use k-means with -k specified to > sample from your input data and produce k clusters. That works pretty well > with text and Cosine distance > > Once you arrive at a "reasonable" number of clusters, you can mess with T1 to > include more points in the centroid calculations but that will not change the > number of clusters. > > > On 5/15/12 10:45 AM, Robert Stewart wrote: >> I am trying to run canopy clustering on vectors extracted from lucene index. >> I want to use CosineDistanceMeasure. How do I know what appropriate values >> to use for t1 and t2 distance threshold? I would assume that Cosine >> distance measure would return "distances" as a range from 0.0 to 1.0 but >> that seems not the case, so how do I know what the potential distance ranges >> are to pick t1 and t2 (other than many trial and errors)? >> >> Thanks >> Bob >> >
