Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as 
expected.  Something else was wrong in my initial run I guess.  

A different question about k-means:  I can successfully cluster using k-means 
but what happens is some clusters are very unrelated, so it seems like there 
needs to be some distance threshold to cluster documents using k-means (so 
clusters with very dis-similar items just dont get put into any cluster).  Is 
that possible with mahout?  I dont see any type of threshold parameters for 
k-means.


On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:

> Hi Bob,
> 
> Cosine distance will return distances on 0.0...1.0 as you suggest. While 
> there is no absolutely foolproof technique for priming canopy T1 & T2 values 
> I recommend you begin by setting T1==T2 and doing a binary search from some 
> initial distance, perhaps 0.1. If you get too few clusters, decrease T1==T2 
> by half and try again. If too many, double etc.
> 
> If you want to be more analytical, use the RandomSeedGenerator to sample from 
> your input vectors and compute a starting point using their inter-cluster 
> distances. You can also skip Canopy and use k-means with -k specified to 
> sample from your input data and produce k clusters. That works pretty well 
> with text and Cosine distance
> 
> Once you arrive at a "reasonable" number of clusters, you can mess with T1 to 
> include more points in the centroid calculations but that will not change the 
> number of clusters.
> 
> 
> On 5/15/12 10:45 AM, Robert Stewart wrote:
>> I am trying to run canopy clustering on vectors extracted from lucene index. 
>>  I want to use CosineDistanceMeasure.  How do I know what appropriate values 
>> to use for t1 and t2 distance threshold?  I would assume that Cosine 
>> distance measure would return "distances" as a range from 0.0 to 1.0 but 
>> that seems not the case, so how do I know what the potential distance ranges 
>> are to pick t1 and t2 (other than many trial and errors)?
>> 
>> Thanks
>> Bob
>> 
> 

Reply via email to