RE: Mahout-279/kmeans++

Whitmore, Mattie Fri, 17 Aug 2012 08:37:54 -0700

Sure, I have a dataset which I wish to cluster using Kmeans.  Previously (v0.5) 
when I did a clusterdump the total amount of vectors within the resultant 
clusters was the same as the total amount fed to the algorithm.  I wish this to 
be the case when clustering with v0.7.  The only change in the algorithm is 
clusterClassificationThreshold,  I set this value to be 0 so that it will in 
fact cluster all vectors in the dataset.

My logic here was no vector should have a probability of being in some cluster 
less than 0 and therefore all vectors should cluster.

However after running a clusterdump I find that vectors (1/3 roughly) have been 
pruned.

Is this a bug, or me just not understanding the new capabilities?

I should also mention I have vectors which are exactly the same (even their 
names), perhaps they are the ones being pruned, is that possible?

Another question if I may: I will eventually want to use the pruning 
capabilities, does the ClusterOutputPostProcessorDriver method (or a similar 
method) have the capability of outputting the pruned vectors into a folder?

Thanks! Please let me know if I'm still not being clear enough.

Mattie

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]] 
Sent: Friday, August 17, 2012 11:20 AM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

clusterClassificationThreshold is for outlier removal, and this is the way it 
should be used.

Can you provide some more information about your job and the way you are 
calling it?

And if I look at the code, the vector should be clustered even if the pdf is 0. 
The method which decides whether the vector should be assigned to a particular 
cluster or not -

/**
    * Decides whether the vector should be classified or not based on the max 
pdf
    * value of the clusters and threshold value.
    *
    * @return whether the vector should be classified or not.
    */
   private static boolean shouldClassify(Vector pdfPerCluster, Double 
clusterClassificationThreshold) {
     return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
   }

On 17-08-2012 20:06, Whitmore, Mattie wrote:

> Hi Ted,
>
> Yes this is great!  I hope to start working with this algorithm in the next 
> couple weeks.
>
> I have a question about the 0.7 implementation of kmeans and the 
> clusterClassificationThreshold,  I have this value set at zero, but the 
> output is still showing that about 1/3 of my data is not assigned to a 
> cluster in my output.  Am I using this value incorrectly?  I did a 
> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite 
> the clusterClassificationThreshold = 0.
>
>
> Thanks,
>
> Mattie
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:[email protected]]
> Sent: Wednesday, August 15, 2012 5:20 PM
> To: [email protected]
> Subject: Re: Mahout-279/kmeans++
>
> Mattie,
>
> Would this help?
>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>
> and
>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>
> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]>wrote:
>
>> Hi!
>>
>> I have been using RandomSeedGenerator, and was hoping it had a patch like
>> that described in Mahout-279 since I want only 10 vectors out of a set of
>> more than 100,000,000.  I have been using canopy clustering for better
>> results, but still need to do a few passes of kmeans to determine my T, and
>> the random seed does take a long time.
>>
>> The comments say that you are working on a kmeans++, I searched around but
>> couldn't confirm any more information about it.  Is a scalable kmeans++ in
>> the works? (I know research on the subject is quite new)
>>
>> Thanks!
>>
>>
>>
>> Mattie Whitmore
>> Mathematician/IR&D Software Engineer
>> HARRIS  Corporation - Advanced Information Solutions
>> 301.837.5278
>> [email protected]<mailto:[email protected]>
>>
>>
>>
>>

RE: Mahout-279/kmeans++

Reply via email to