Hi Ted, Yes this is great! I hope to start working with this algorithm in the next couple weeks.
I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold, I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output. Am I using this value incorrectly? I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0. Thanks, Mattie -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Wednesday, August 15, 2012 5:20 PM To: [email protected] Subject: Re: Mahout-279/kmeans++ Mattie, Would this help? https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java and https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]>wrote: > Hi! > > I have been using RandomSeedGenerator, and was hoping it had a patch like > that described in Mahout-279 since I want only 10 vectors out of a set of > more than 100,000,000. I have been using canopy clustering for better > results, but still need to do a few passes of kmeans to determine my T, and > the random seed does take a long time. > > The comments say that you are working on a kmeans++, I searched around but > couldn't confirm any more information about it. Is a scalable kmeans++ in > the works? (I know research on the subject is quite new) > > Thanks! > > > > Mattie Whitmore > Mathematician/IR&D Software Engineer > HARRIS Corporation - Advanced Information Solutions > 301.837.5278 > [email protected]<mailto:[email protected]> > > > >
