A threshold parameter for K-Means has been added in Mahout 0.7-snapshot. You will not find it in Mahout 0.6. See clusterClassificationThreshold parameter in KMeansDriver.

 public static void run(

Configuration conf, Path input, Path clustersIn, Path output, DistanceMeasure 
measure,

      double convergenceDelta, int maxIterations, boolean runClustering, double 
clusterClassificationThreshold,

      boolean runSequential)


On 15-05-2012 21:06, Robert Stewart wrote:
Thanks Jeff.  I do see that cosine distance does return 0.0-1.0 now as 
expected.  Something else was wrong in my initial run I guess.

A different question about k-means:  I can successfully cluster using k-means 
but what happens is some clusters are very unrelated, so it seems like there 
needs to be some distance threshold to cluster documents using k-means (so 
clusters with very dis-similar items just dont get put into any cluster).  Is 
that possible with mahout?  I dont see any type of threshold parameters for 
k-means.


On May 15, 2012, at 11:16 AM, Jeff Eastman wrote:

Hi Bob,

Cosine distance will return distances on 0.0...1.0 as you suggest. While there is 
no absolutely foolproof technique for priming canopy T1&  T2 values I recommend 
you begin by setting T1==T2 and doing a binary search from some initial distance, 
perhaps 0.1. If you get too few clusters, decrease T1==T2 by half and try again. If 
too many, double etc.

If you want to be more analytical, use the RandomSeedGenerator to sample from 
your input vectors and compute a starting point using their inter-cluster 
distances. You can also skip Canopy and use k-means with -k specified to sample 
from your input data and produce k clusters. That works pretty well with text 
and Cosine distance

Once you arrive at a "reasonable" number of clusters, you can mess with T1 to 
include more points in the centroid calculations but that will not change the number of 
clusters.


On 5/15/12 10:45 AM, Robert Stewart wrote:
I am trying to run canopy clustering on vectors extracted from lucene index.  I want to 
use CosineDistanceMeasure.  How do I know what appropriate values to use for t1 and t2 
distance threshold?  I would assume that Cosine distance measure would return 
"distances" as a range from 0.0 to 1.0 but that seems not the case, so how do I 
know what the potential distance ranges are to pick t1 and t2 (other than many trial and 
errors)?

Thanks
Bob


Reply via email to