Hi Ted,

Yes this is great!  I hope to start working with this algorithm in the next 
couple weeks.

I have a question about the 0.7 implementation of kmeans and the 
clusterClassificationThreshold,  I have this value set at zero, but the output 
is still showing that about 1/3 of my data is not assigned to a cluster in my 
output.  Am I using this value incorrectly?  I did a kmeansdriver.run with the 
0.5 and 0.7 api, and had the data pruned despite the 
clusterClassificationThreshold = 0.


Thanks,

Mattie


-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Wednesday, August 15, 2012 5:20 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

Mattie,

Would this help?

https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java

and

https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf

On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]>wrote:

> Hi!
>
> I have been using RandomSeedGenerator, and was hoping it had a patch like
> that described in Mahout-279 since I want only 10 vectors out of a set of
> more than 100,000,000.  I have been using canopy clustering for better
> results, but still need to do a few passes of kmeans to determine my T, and
> the random seed does take a long time.
>
> The comments say that you are working on a kmeans++, I searched around but
> couldn't confirm any more information about it.  Is a scalable kmeans++ in
> the works? (I know research on the subject is quite new)
>
> Thanks!
>
>
>
> Mattie Whitmore
> Mathematician/IR&D Software Engineer
> HARRIS  Corporation - Advanced Information Solutions
> 301.837.5278
> [email protected]<mailto:[email protected]>
>
>
>
>

Reply via email to