clusterClassificationThreshold is for outlier removal, and this is the way it
should be used.
Can you provide some more information about your job and the way you are
calling it?
And if I look at the code, the vector should be clustered even if the pdf is 0.
The method which decides whether the vector should be assigned to a particular
cluster or not -
/**
* Decides whether the vector should be classified or not based on the max pdf
* value of the clusters and threshold value.
*
* @return whether the vector should be classified or not.
*/
private static boolean shouldClassify(Vector pdfPerCluster, Double
clusterClassificationThreshold) {
return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
}
On 17-08-2012 20:06, Whitmore, Mattie wrote:
Hi Ted,
Yes this is great! I hope to start working with this algorithm in the next
couple weeks.
I have a question about the 0.7 implementation of kmeans and the
clusterClassificationThreshold, I have this value set at zero, but the output
is still showing that about 1/3 of my data is not assigned to a cluster in my
output. Am I using this value incorrectly? I did a kmeansdriver.run with the
0.5 and 0.7 api, and had the data pruned despite the
clusterClassificationThreshold = 0.
Thanks,
Mattie
-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Wednesday, August 15, 2012 5:20 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++
Mattie,
Would this help?
https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
and
https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]>wrote:
Hi!
I have been using RandomSeedGenerator, and was hoping it had a patch like
that described in Mahout-279 since I want only 10 vectors out of a set of
more than 100,000,000. I have been using canopy clustering for better
results, but still need to do a few passes of kmeans to determine my T, and
the random seed does take a long time.
The comments say that you are working on a kmeans++, I searched around but
couldn't confirm any more information about it. Is a scalable kmeans++ in
the works? (I know research on the subject is quite new)
Thanks!
Mattie Whitmore
Mathematician/IR&D Software Engineer
HARRIS Corporation - Advanced Information Solutions
301.837.5278
[email protected]<mailto:[email protected]>