Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend. If this makes no difference I will also try the noise idea, and I'll follow up about both.
My next question is regarding clusterDump. Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory. Is this a possibility? Or in the works? Thanks! -----Original Message----- From: Paritosh Ranjan [mailto:[email protected]] Sent: Wednesday, August 22, 2012 9:09 PM To: [email protected] Subject: Re: Mahout-279/kmeans++ Can you also try to provide distinct names to vectors and then cluster? It should not have any affect, but would be good to know the behavior. On 22-08-2012 23:10, Whitmore, Mattie wrote: > Yes, I have data which is exactly the same. If I give every vector a name > which is distinct (albeit the data point is the same as other points in the > set) will this keep the algorithm from dropping non-distinct vectors/data > points (which is what I THINK but have yet to verify is what is going on)? > > Thanks, > > Mattie > > -----Original Message----- > From: Ted Dunning [mailto:[email protected]] > Sent: Wednesday, August 22, 2012 1:18 PM > To: [email protected] > Subject: Re: Mahout-279/kmeans++ > > Just an off thought, do you have duplicate input points? > > On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[email protected]>wrote: > >> ... I have also verified by running canopy multiple times with 0.5 and 0.7 >> that there is a continual discrepancy between the two clustering versions. >> The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: >> 921998/5. They should not necessarily be the same, since I am using canopy >> clustering to find initial centroids, however I would think they would have >> the same sum, which they do not (45901885 vs 1599154). >> >> Here is the method I am running: >> >> public static void KmeansClusteringCanopy(String outputDir, String T, >> String itMax) >> throws IOException, InterruptedException, >> ClassNotFoundException, >> InstantiationException, IllegalAccessException { >> >> Configuration conf = new Configuration(); >> >> DistanceMeasure measure = new EuclideanDistanceMeasure(); >> >> Path vectorsFolder = new Path(outputDir, "vectors"); >> Path clusterCenters = new Path(outputDir + >> "-canopy/centriods"); >> Path clusterOutput = new Path(outputDir + >> "-canopy/clusters"); >> >> // create canopies instead of initial vectors >> CanopyDriver.run(conf, vectorsFolder, clusterCenters, >> measure, >> Double.parseDouble(T), >> Double.parseDouble(T), false, 0, false); >> >> >> // kmeans cluster operation >> KMeansDriver.run(conf, vectorsFolder, new >> Path(clusterCenters, >> "clusters-0-final/part-r-00000"), >> clusterOutput, measure, 0.01, >> Integer.parseInt(itMax), true, 0.0, false); >> >> >> //post process by putting completed clusters into their >> own files. >> ClusterOutputPostProcessorDriver.run(clusterOutput, >> new >> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); >> >> } >> >> What do you think? >> >> On another but related note: Is there a plan to have a method -- say >> ClusterOutputPostProcessorDriver -- which when run outputs the vectors >> within clusters as well as a separate folder containing pruned outliers? >> >> Thanks! >> >> Mattie >> >> -----Original Message----- >> From: Paritosh Ranjan [mailto:[email protected]] >> Sent: Friday, August 17, 2012 12:16 PM >> To: [email protected] >> Subject: Re: Mahout-279/kmeans++ >> >> The clustering algorithm has also changed internally. So, expect the >> results to be different ( and better ). >> >> I can think of one reason for this behavior. Maybe lots of clusters are >> having only one vector inside it, and, AFAIK, clusterdumper will not >> output any cluster with single vector. >> So, I think, its clusterdumper which is doing the invisible "pruning" ( >> by not ouputting clusters with single vectors ). >> >> Can you cross check the output once with ClusterOutputPostProcessorDriver? >> >> No, no tool can output the pruned vectors. The only way to see all >> vectors assigned to any cluster is to set clusterClassificationThreshold >> to 0. >> >> If you still face the problem, then please provide the parameters with >> which you are calling kmeans. >> >> Regarding "I should also mention I have vectors which are exactly the >> same (even their names), perhaps they are the ones being pruned, is that >> possible? " >> >> The name of the vector has nothing to do with clustering, I am not sure >> whether it will have any effect when clusterdumper is in action. So, >> crosschecking with ClusterOutputPostProcessorDriver will answer this. >> >> Good luck. >> Paritosh >> >> On 17-08-2012 21:07, Whitmore, Mattie wrote: >>> Sure, I have a dataset which I wish to cluster using Kmeans. Previously >> (v0.5) when I did a clusterdump the total amount of vectors within the >> resultant clusters was the same as the total amount fed to the algorithm. >> I wish this to be the case when clustering with v0.7. The only change in >> the algorithm is clusterClassificationThreshold, I set this value to be 0 >> so that it will in fact cluster all vectors in the dataset. >>> My logic here was no vector should have a probability of being in some >> cluster less than 0 and therefore all vectors should cluster. >>> However after running a clusterdump I find that vectors (1/3 roughly) >> have been pruned. >>> Is this a bug, or me just not understanding the new capabilities? >>> >>> I should also mention I have vectors which are exactly the same (even >> their names), perhaps they are the ones being pruned, is that possible? >>> Another question if I may: I will eventually want to use the pruning >> capabilities, does the ClusterOutputPostProcessorDriver method (or a >> similar method) have the capability of outputting the pruned vectors into a >> folder? >>> Thanks! Please let me know if I'm still not being clear enough. >>> >>> Mattie >>> >>> -----Original Message----- >>> From: Paritosh Ranjan [mailto:[email protected]] >>> Sent: Friday, August 17, 2012 11:20 AM >>> To: [email protected] >>> Subject: Re: Mahout-279/kmeans++ >>> >>> clusterClassificationThreshold is for outlier removal, and this is the >> way it should be used. >>> Can you provide some more information about your job and the way you are >> calling it? >>> And if I look at the code, the vector should be clustered even if the >> pdf is 0. The method which decides whether the vector should be assigned to >> a particular cluster or not - >>> /** >>> * Decides whether the vector should be classified or not based on >> the max pdf >>> * value of the clusters and threshold value. >>> * >>> * @return whether the vector should be classified or not. >>> */ >>> private static boolean shouldClassify(Vector pdfPerCluster, Double >> clusterClassificationThreshold) { >>> return pdfPerCluster.maxValue() >= clusterClassificationThreshold; >>> } >>> >>> On 17-08-2012 20:06, Whitmore, Mattie wrote: >>> >>>> Hi Ted, >>>> >>>> Yes this is great! I hope to start working with this algorithm in the >> next couple weeks. >>>> I have a question about the 0.7 implementation of kmeans and the >> clusterClassificationThreshold, I have this value set at zero, but the >> output is still showing that about 1/3 of my data is not assigned to a >> cluster in my output. Am I using this value incorrectly? I did a >> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite >> the clusterClassificationThreshold = 0. >>>> >>>> Thanks, >>>> >>>> Mattie >>>> >>>> >>>> -----Original Message----- >>>> From: Ted Dunning [mailto:[email protected]] >>>> Sent: Wednesday, August 15, 2012 5:20 PM >>>> To: [email protected] >>>> Subject: Re: Mahout-279/kmeans++ >>>> >>>> Mattie, >>>> >>>> Would this help? >>>> >>>> >> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java >>>> and >>>> >>>> >> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf >>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected] >>> wrote: >>>>> Hi! >>>>> >>>>> I have been using RandomSeedGenerator, and was hoping it had a patch >> like >>>>> that described in Mahout-279 since I want only 10 vectors out of a set >> of >>>>> more than 100,000,000. I have been using canopy clustering for better >>>>> results, but still need to do a few passes of kmeans to determine my >> T, and >>>>> the random seed does take a long time. >>>>> >>>>> The comments say that you are working on a kmeans++, I searched around >> but >>>>> couldn't confirm any more information about it. Is a scalable >> kmeans++ in >>>>> the works? (I know research on the subject is quite new) >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> >>>>> Mattie Whitmore >>>>> Mathematician/IR&D Software Engineer >>>>> HARRIS Corporation - Advanced Information Solutions >>>>> 301.837.5278 >>>>> [email protected]<mailto:[email protected]> >>>>> >>>>> >>>>> >>>>>
