RE: Mahout-279/kmeans++

Whitmore, Mattie Wed, 29 Aug 2012 07:38:09 -0700

I re-ran the canopy-kmeans analytic, this time with unique names, I lost more 
points in the resulting clusters ( total points in the clusters = 745490, vs 
previously: 1599154 for v0.7 and 45901885 for v0.5).  The total number of data 
points fed into the algorithm is 53365862 -- so even v0.5 is missing 14% of the 
data.


I'm thinking if I weight these dense vectors with a weight equal to the number 
of identical vectors in the set that could work -- Ball Kmeans seems to do 
this.  Is this a correct interpretation of how to use weights in Ball Kmeans, 
and is Ball Kmeans ready enough to be used/tested?

Thanks

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]] 
Sent: Thursday, August 23, 2012 12:34 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

clusterDump works in memory, and there are no plans yet to make it distributed 
( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940

clusterpp has an option for distributed processing, so you can process any 
amount of data with it.

On 23-08-2012 19:55, Whitmore, Mattie wrote:
> Yes, unique names will be my next plan -- I just can't kick off that job 
> until after the weekend.  If this makes no difference I will also try the 
> noise idea, and I'll follow up about both.
>
> My next question is regarding clusterDump.  Is there a way to run this in 
> parallel? I have found some code to execute in java (the preferable method 
> for me) but I would like the method to be faster and not in memory.  Is this 
> a possibility? Or in the works?
>
> Thanks!
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:[email protected]]
> Sent: Wednesday, August 22, 2012 9:09 PM
> To: [email protected]
> Subject: Re: Mahout-279/kmeans++
>
> Can you also try to provide distinct names to vectors and then cluster?
> It should not have any affect, but would be good to know the behavior.
>
> On 22-08-2012 23:10, Whitmore, Mattie wrote:
>> Yes, I have data which is exactly the same.  If I give every vector a name 
>> which is distinct (albeit the data point is the same as other points in the 
>> set) will this keep the algorithm from dropping non-distinct vectors/data 
>> points (which is what I THINK but have yet to verify is what is going on)?
>>
>> Thanks,
>>
>> Mattie
>>
>> -----Original Message-----
>> From: Ted Dunning [mailto:[email protected]]
>> Sent: Wednesday, August 22, 2012 1:18 PM
>> To: [email protected]
>> Subject: Re: Mahout-279/kmeans++
>>
>> Just an off thought, do you have duplicate input points?
>>
>> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie 
>> <[email protected]>wrote:
>>
>>> ... I have also verified by running canopy multiple times with 0.5 and 0.7
>>> that there is a continual discrepancy between the two clustering versions.
>>>    The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
>>> 921998/5.  They should not necessarily be the same, since I am using canopy
>>> clustering to find initial centroids, however I would think they would have
>>> the same sum, which they do not (45901885 vs 1599154).
>>>
>>> Here is the method I am running:
>>>
>>> public static void KmeansClusteringCanopy(String outputDir, String T,
>>> String itMax)
>>>                           throws IOException, InterruptedException,
>>> ClassNotFoundException,
>>>                           InstantiationException, IllegalAccessException {
>>>
>>>                   Configuration conf = new Configuration();
>>>
>>>                   DistanceMeasure measure = new EuclideanDistanceMeasure();
>>>
>>>                   Path vectorsFolder = new Path(outputDir, "vectors");
>>>                   Path clusterCenters = new Path(outputDir +
>>> "-canopy/centriods");
>>>                   Path clusterOutput = new Path(outputDir +
>>> "-canopy/clusters");
>>>
>>>                   // create canopies instead of initial vectors
>>>                   CanopyDriver.run(conf, vectorsFolder, clusterCenters,
>>> measure,
>>>                                   Double.parseDouble(T),
>>> Double.parseDouble(T), false, 0, false);
>>>
>>>
>>>                   // kmeans cluster operation
>>>                   KMeansDriver.run(conf, vectorsFolder, new
>>> Path(clusterCenters,
>>>                                   "clusters-0-final/part-r-00000"),
>>> clusterOutput, measure, 0.01,
>>>                                   Integer.parseInt(itMax), true, 0.0, 
>>> false);
>>>
>>>
>>>                   //post process by putting completed clusters into their
>>> own files.
>>>                   ClusterOutputPostProcessorDriver.run(clusterOutput,
>>>                                   new
>>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>>>
>>>           }
>>>
>>> What do you think?
>>>
>>> On another but related note: Is there a plan to have a method -- say
>>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
>>> within clusters as well as a separate folder containing pruned outliers?
>>>
>>> Thanks!
>>>
>>> Mattie
>>>
>>> -----Original Message-----
>>> From: Paritosh Ranjan [mailto:[email protected]]
>>> Sent: Friday, August 17, 2012 12:16 PM
>>> To: [email protected]
>>> Subject: Re: Mahout-279/kmeans++
>>>
>>> The clustering algorithm has also changed internally. So, expect the
>>> results to be different ( and better ).
>>>
>>> I can think of one reason for this behavior. Maybe lots of clusters are
>>> having only one vector inside it, and, AFAIK, clusterdumper will not
>>> output any cluster with single vector.
>>> So, I think, its clusterdumper which is doing the invisible "pruning" (
>>> by not ouputting clusters with single vectors ).
>>>
>>> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>>>
>>> No, no tool can output the pruned vectors. The only way to see all
>>> vectors assigned to any cluster is to set clusterClassificationThreshold
>>> to 0.
>>>
>>> If you still face the problem, then please provide the parameters with
>>> which you are calling kmeans.
>>>
>>> Regarding "I should also mention I have vectors which are exactly the
>>> same (even their names), perhaps they are the ones being pruned, is that
>>> possible? "
>>>
>>> The name of the vector has nothing to do with clustering, I am not sure
>>> whether it will have any effect when clusterdumper is in action. So,
>>> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>>>
>>> Good luck.
>>> Paritosh
>>>
>>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
>>>> Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
>>> (v0.5) when I did a clusterdump the total amount of vectors within the
>>> resultant clusters was the same as the total amount fed to the algorithm.
>>>    I wish this to be the case when clustering with v0.7.  The only change in
>>> the algorithm is clusterClassificationThreshold,  I set this value to be 0
>>> so that it will in fact cluster all vectors in the dataset.
>>>> My logic here was no vector should have a probability of being in some
>>> cluster less than 0 and therefore all vectors should cluster.
>>>> However after running a clusterdump I find that vectors (1/3 roughly)
>>> have been pruned.
>>>> Is this a bug, or me just not understanding the new capabilities?
>>>>
>>>> I should also mention I have vectors which are exactly the same (even
>>> their names), perhaps they are the ones being pruned, is that possible?
>>>> Another question if I may: I will eventually want to use the pruning
>>> capabilities, does the ClusterOutputPostProcessorDriver method (or a
>>> similar method) have the capability of outputting the pruned vectors into a
>>> folder?
>>>> Thanks! Please let me know if I'm still not being clear enough.
>>>>
>>>> Mattie
>>>>
>>>> -----Original Message-----
>>>> From: Paritosh Ranjan [mailto:[email protected]]
>>>> Sent: Friday, August 17, 2012 11:20 AM
>>>> To: [email protected]
>>>> Subject: Re: Mahout-279/kmeans++
>>>>
>>>> clusterClassificationThreshold is for outlier removal, and this is the
>>> way it should be used.
>>>> Can you provide some more information about your job and the way you are
>>> calling it?
>>>> And if I look at the code, the vector should be clustered even if the
>>> pdf is 0. The method which decides whether the vector should be assigned to
>>> a particular cluster or not -
>>>> /**
>>>>        * Decides whether the vector should be classified or not based on
>>> the max pdf
>>>>        * value of the clusters and threshold value.
>>>>        *
>>>>        * @return whether the vector should be classified or not.
>>>>        */
>>>>       private static boolean shouldClassify(Vector pdfPerCluster, Double
>>> clusterClassificationThreshold) {
>>>>         return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
>>>>       }
>>>>
>>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
>>>>
>>>>> Hi Ted,
>>>>>
>>>>> Yes this is great!  I hope to start working with this algorithm in the
>>> next couple weeks.
>>>>> I have a question about the 0.7 implementation of kmeans and the
>>> clusterClassificationThreshold,  I have this value set at zero, but the
>>> output is still showing that about 1/3 of my data is not assigned to a
>>> cluster in my output.  Am I using this value incorrectly?  I did a
>>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
>>> the clusterClassificationThreshold = 0.
>>>>> Thanks,
>>>>>
>>>>> Mattie
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ted Dunning [mailto:[email protected]]
>>>>> Sent: Wednesday, August 15, 2012 5:20 PM
>>>>> To: [email protected]
>>>>> Subject: Re: Mahout-279/kmeans++
>>>>>
>>>>> Mattie,
>>>>>
>>>>> Would this help?
>>>>>
>>>>>
>>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
>>>>> and
>>>>>
>>>>>
>>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
>>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]
>>>> wrote:
>>>>>> Hi!
>>>>>>
>>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch
>>> like
>>>>>> that described in Mahout-279 since I want only 10 vectors out of a set
>>> of
>>>>>> more than 100,000,000.  I have been using canopy clustering for better
>>>>>> results, but still need to do a few passes of kmeans to determine my
>>> T, and
>>>>>> the random seed does take a long time.
>>>>>>
>>>>>> The comments say that you are working on a kmeans++, I searched around
>>> but
>>>>>> couldn't confirm any more information about it.  Is a scalable
>>> kmeans++ in
>>>>>> the works? (I know research on the subject is quite new)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Mattie Whitmore
>>>>>> Mathematician/IR&D Software Engineer
>>>>>> HARRIS  Corporation - Advanced Information Solutions
>>>>>> 301.837.5278
>>>>>> [email protected]<mailto:[email protected]>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>

RE: Mahout-279/kmeans++

Reply via email to