I re-ran the canopy-kmeans analytic, this time with unique names, I lost more points in the resulting clusters ( total points in the clusters = 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The total number of data points fed into the algorithm is 53365862 -- so even v0.5 is missing 14% of the data.
I'm thinking if I weight these dense vectors with a weight equal to the number of identical vectors in the set that could work -- Ball Kmeans seems to do this. Is this a correct interpretation of how to use weights in Ball Kmeans, and is Ball Kmeans ready enough to be used/tested? Thanks -----Original Message----- From: Paritosh Ranjan [mailto:[email protected]] Sent: Thursday, August 23, 2012 12:34 PM To: [email protected] Subject: Re: Mahout-279/kmeans++ clusterDump works in memory, and there are no plans yet to make it distributed ( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940 clusterpp has an option for distributed processing, so you can process any amount of data with it. On 23-08-2012 19:55, Whitmore, Mattie wrote: > Yes, unique names will be my next plan -- I just can't kick off that job > until after the weekend. If this makes no difference I will also try the > noise idea, and I'll follow up about both. > > My next question is regarding clusterDump. Is there a way to run this in > parallel? I have found some code to execute in java (the preferable method > for me) but I would like the method to be faster and not in memory. Is this > a possibility? Or in the works? > > Thanks! > > -----Original Message----- > From: Paritosh Ranjan [mailto:[email protected]] > Sent: Wednesday, August 22, 2012 9:09 PM > To: [email protected] > Subject: Re: Mahout-279/kmeans++ > > Can you also try to provide distinct names to vectors and then cluster? > It should not have any affect, but would be good to know the behavior. > > On 22-08-2012 23:10, Whitmore, Mattie wrote: >> Yes, I have data which is exactly the same. If I give every vector a name >> which is distinct (albeit the data point is the same as other points in the >> set) will this keep the algorithm from dropping non-distinct vectors/data >> points (which is what I THINK but have yet to verify is what is going on)? >> >> Thanks, >> >> Mattie >> >> -----Original Message----- >> From: Ted Dunning [mailto:[email protected]] >> Sent: Wednesday, August 22, 2012 1:18 PM >> To: [email protected] >> Subject: Re: Mahout-279/kmeans++ >> >> Just an off thought, do you have duplicate input points? >> >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie >> <[email protected]>wrote: >> >>> ... I have also verified by running canopy multiple times with 0.5 and 0.7 >>> that there is a continual discrepancy between the two clustering versions. >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: >>> 921998/5. They should not necessarily be the same, since I am using canopy >>> clustering to find initial centroids, however I would think they would have >>> the same sum, which they do not (45901885 vs 1599154). >>> >>> Here is the method I am running: >>> >>> public static void KmeansClusteringCanopy(String outputDir, String T, >>> String itMax) >>> throws IOException, InterruptedException, >>> ClassNotFoundException, >>> InstantiationException, IllegalAccessException { >>> >>> Configuration conf = new Configuration(); >>> >>> DistanceMeasure measure = new EuclideanDistanceMeasure(); >>> >>> Path vectorsFolder = new Path(outputDir, "vectors"); >>> Path clusterCenters = new Path(outputDir + >>> "-canopy/centriods"); >>> Path clusterOutput = new Path(outputDir + >>> "-canopy/clusters"); >>> >>> // create canopies instead of initial vectors >>> CanopyDriver.run(conf, vectorsFolder, clusterCenters, >>> measure, >>> Double.parseDouble(T), >>> Double.parseDouble(T), false, 0, false); >>> >>> >>> // kmeans cluster operation >>> KMeansDriver.run(conf, vectorsFolder, new >>> Path(clusterCenters, >>> "clusters-0-final/part-r-00000"), >>> clusterOutput, measure, 0.01, >>> Integer.parseInt(itMax), true, 0.0, >>> false); >>> >>> >>> //post process by putting completed clusters into their >>> own files. >>> ClusterOutputPostProcessorDriver.run(clusterOutput, >>> new >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); >>> >>> } >>> >>> What do you think? >>> >>> On another but related note: Is there a plan to have a method -- say >>> ClusterOutputPostProcessorDriver -- which when run outputs the vectors >>> within clusters as well as a separate folder containing pruned outliers? >>> >>> Thanks! >>> >>> Mattie >>> >>> -----Original Message----- >>> From: Paritosh Ranjan [mailto:[email protected]] >>> Sent: Friday, August 17, 2012 12:16 PM >>> To: [email protected] >>> Subject: Re: Mahout-279/kmeans++ >>> >>> The clustering algorithm has also changed internally. So, expect the >>> results to be different ( and better ). >>> >>> I can think of one reason for this behavior. Maybe lots of clusters are >>> having only one vector inside it, and, AFAIK, clusterdumper will not >>> output any cluster with single vector. >>> So, I think, its clusterdumper which is doing the invisible "pruning" ( >>> by not ouputting clusters with single vectors ). >>> >>> Can you cross check the output once with ClusterOutputPostProcessorDriver? >>> >>> No, no tool can output the pruned vectors. The only way to see all >>> vectors assigned to any cluster is to set clusterClassificationThreshold >>> to 0. >>> >>> If you still face the problem, then please provide the parameters with >>> which you are calling kmeans. >>> >>> Regarding "I should also mention I have vectors which are exactly the >>> same (even their names), perhaps they are the ones being pruned, is that >>> possible? " >>> >>> The name of the vector has nothing to do with clustering, I am not sure >>> whether it will have any effect when clusterdumper is in action. So, >>> crosschecking with ClusterOutputPostProcessorDriver will answer this. >>> >>> Good luck. >>> Paritosh >>> >>> On 17-08-2012 21:07, Whitmore, Mattie wrote: >>>> Sure, I have a dataset which I wish to cluster using Kmeans. Previously >>> (v0.5) when I did a clusterdump the total amount of vectors within the >>> resultant clusters was the same as the total amount fed to the algorithm. >>> I wish this to be the case when clustering with v0.7. The only change in >>> the algorithm is clusterClassificationThreshold, I set this value to be 0 >>> so that it will in fact cluster all vectors in the dataset. >>>> My logic here was no vector should have a probability of being in some >>> cluster less than 0 and therefore all vectors should cluster. >>>> However after running a clusterdump I find that vectors (1/3 roughly) >>> have been pruned. >>>> Is this a bug, or me just not understanding the new capabilities? >>>> >>>> I should also mention I have vectors which are exactly the same (even >>> their names), perhaps they are the ones being pruned, is that possible? >>>> Another question if I may: I will eventually want to use the pruning >>> capabilities, does the ClusterOutputPostProcessorDriver method (or a >>> similar method) have the capability of outputting the pruned vectors into a >>> folder? >>>> Thanks! Please let me know if I'm still not being clear enough. >>>> >>>> Mattie >>>> >>>> -----Original Message----- >>>> From: Paritosh Ranjan [mailto:[email protected]] >>>> Sent: Friday, August 17, 2012 11:20 AM >>>> To: [email protected] >>>> Subject: Re: Mahout-279/kmeans++ >>>> >>>> clusterClassificationThreshold is for outlier removal, and this is the >>> way it should be used. >>>> Can you provide some more information about your job and the way you are >>> calling it? >>>> And if I look at the code, the vector should be clustered even if the >>> pdf is 0. The method which decides whether the vector should be assigned to >>> a particular cluster or not - >>>> /** >>>> * Decides whether the vector should be classified or not based on >>> the max pdf >>>> * value of the clusters and threshold value. >>>> * >>>> * @return whether the vector should be classified or not. >>>> */ >>>> private static boolean shouldClassify(Vector pdfPerCluster, Double >>> clusterClassificationThreshold) { >>>> return pdfPerCluster.maxValue() >= clusterClassificationThreshold; >>>> } >>>> >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote: >>>> >>>>> Hi Ted, >>>>> >>>>> Yes this is great! I hope to start working with this algorithm in the >>> next couple weeks. >>>>> I have a question about the 0.7 implementation of kmeans and the >>> clusterClassificationThreshold, I have this value set at zero, but the >>> output is still showing that about 1/3 of my data is not assigned to a >>> cluster in my output. Am I using this value incorrectly? I did a >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite >>> the clusterClassificationThreshold = 0. >>>>> Thanks, >>>>> >>>>> Mattie >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Ted Dunning [mailto:[email protected]] >>>>> Sent: Wednesday, August 15, 2012 5:20 PM >>>>> To: [email protected] >>>>> Subject: Re: Mahout-279/kmeans++ >>>>> >>>>> Mattie, >>>>> >>>>> Would this help? >>>>> >>>>> >>> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java >>>>> and >>>>> >>>>> >>> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected] >>>> wrote: >>>>>> Hi! >>>>>> >>>>>> I have been using RandomSeedGenerator, and was hoping it had a patch >>> like >>>>>> that described in Mahout-279 since I want only 10 vectors out of a set >>> of >>>>>> more than 100,000,000. I have been using canopy clustering for better >>>>>> results, but still need to do a few passes of kmeans to determine my >>> T, and >>>>>> the random seed does take a long time. >>>>>> >>>>>> The comments say that you are working on a kmeans++, I searched around >>> but >>>>>> couldn't confirm any more information about it. Is a scalable >>> kmeans++ in >>>>>> the works? (I know research on the subject is quite new) >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> >>>>>> Mattie Whitmore >>>>>> Mathematician/IR&D Software Engineer >>>>>> HARRIS Corporation - Advanced Information Solutions >>>>>> 301.837.5278 >>>>>> [email protected]<mailto:[email protected]> >>>>>> >>>>>> >>>>>> >>>>>> >
