Yes, I have data which is exactly the same. If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?
Thanks, Mattie -----Original Message----- From: Ted Dunning [mailto:[email protected]] Sent: Wednesday, August 22, 2012 1:18 PM To: [email protected] Subject: Re: Mahout-279/kmeans++ Just an off thought, do you have duplicate input points? On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[email protected]>wrote: > ... I have also verified by running canopy multiple times with 0.5 and 0.7 > that there is a continual discrepancy between the two clustering versions. > The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: > 921998/5. They should not necessarily be the same, since I am using canopy > clustering to find initial centroids, however I would think they would have > the same sum, which they do not (45901885 vs 1599154). > > Here is the method I am running: > > public static void KmeansClusteringCanopy(String outputDir, String T, > String itMax) > throws IOException, InterruptedException, > ClassNotFoundException, > InstantiationException, IllegalAccessException { > > Configuration conf = new Configuration(); > > DistanceMeasure measure = new EuclideanDistanceMeasure(); > > Path vectorsFolder = new Path(outputDir, "vectors"); > Path clusterCenters = new Path(outputDir + > "-canopy/centriods"); > Path clusterOutput = new Path(outputDir + > "-canopy/clusters"); > > // create canopies instead of initial vectors > CanopyDriver.run(conf, vectorsFolder, clusterCenters, > measure, > Double.parseDouble(T), > Double.parseDouble(T), false, 0, false); > > > // kmeans cluster operation > KMeansDriver.run(conf, vectorsFolder, new > Path(clusterCenters, > "clusters-0-final/part-r-00000"), > clusterOutput, measure, 0.01, > Integer.parseInt(itMax), true, 0.0, false); > > > //post process by putting completed clusters into their > own files. > ClusterOutputPostProcessorDriver.run(clusterOutput, > new > Path(clusterOutput+"/CanopyClusterVectorFolders"), false); > > } > > What do you think? > > On another but related note: Is there a plan to have a method -- say > ClusterOutputPostProcessorDriver -- which when run outputs the vectors > within clusters as well as a separate folder containing pruned outliers? > > Thanks! > > Mattie > > -----Original Message----- > From: Paritosh Ranjan [mailto:[email protected]] > Sent: Friday, August 17, 2012 12:16 PM > To: [email protected] > Subject: Re: Mahout-279/kmeans++ > > The clustering algorithm has also changed internally. So, expect the > results to be different ( and better ). > > I can think of one reason for this behavior. Maybe lots of clusters are > having only one vector inside it, and, AFAIK, clusterdumper will not > output any cluster with single vector. > So, I think, its clusterdumper which is doing the invisible "pruning" ( > by not ouputting clusters with single vectors ). > > Can you cross check the output once with ClusterOutputPostProcessorDriver? > > No, no tool can output the pruned vectors. The only way to see all > vectors assigned to any cluster is to set clusterClassificationThreshold > to 0. > > If you still face the problem, then please provide the parameters with > which you are calling kmeans. > > Regarding "I should also mention I have vectors which are exactly the > same (even their names), perhaps they are the ones being pruned, is that > possible? " > > The name of the vector has nothing to do with clustering, I am not sure > whether it will have any effect when clusterdumper is in action. So, > crosschecking with ClusterOutputPostProcessorDriver will answer this. > > Good luck. > Paritosh > > On 17-08-2012 21:07, Whitmore, Mattie wrote: > > Sure, I have a dataset which I wish to cluster using Kmeans. Previously > (v0.5) when I did a clusterdump the total amount of vectors within the > resultant clusters was the same as the total amount fed to the algorithm. > I wish this to be the case when clustering with v0.7. The only change in > the algorithm is clusterClassificationThreshold, I set this value to be 0 > so that it will in fact cluster all vectors in the dataset. > > > > My logic here was no vector should have a probability of being in some > cluster less than 0 and therefore all vectors should cluster. > > > > However after running a clusterdump I find that vectors (1/3 roughly) > have been pruned. > > > > Is this a bug, or me just not understanding the new capabilities? > > > > I should also mention I have vectors which are exactly the same (even > their names), perhaps they are the ones being pruned, is that possible? > > > > Another question if I may: I will eventually want to use the pruning > capabilities, does the ClusterOutputPostProcessorDriver method (or a > similar method) have the capability of outputting the pruned vectors into a > folder? > > > > Thanks! Please let me know if I'm still not being clear enough. > > > > Mattie > > > > -----Original Message----- > > From: Paritosh Ranjan [mailto:[email protected]] > > Sent: Friday, August 17, 2012 11:20 AM > > To: [email protected] > > Subject: Re: Mahout-279/kmeans++ > > > > clusterClassificationThreshold is for outlier removal, and this is the > way it should be used. > > > > Can you provide some more information about your job and the way you are > calling it? > > > > And if I look at the code, the vector should be clustered even if the > pdf is 0. The method which decides whether the vector should be assigned to > a particular cluster or not - > > > > /** > > * Decides whether the vector should be classified or not based on > the max pdf > > * value of the clusters and threshold value. > > * > > * @return whether the vector should be classified or not. > > */ > > private static boolean shouldClassify(Vector pdfPerCluster, Double > clusterClassificationThreshold) { > > return pdfPerCluster.maxValue() >= clusterClassificationThreshold; > > } > > > > On 17-08-2012 20:06, Whitmore, Mattie wrote: > > > >> Hi Ted, > >> > >> Yes this is great! I hope to start working with this algorithm in the > next couple weeks. > >> > >> I have a question about the 0.7 implementation of kmeans and the > clusterClassificationThreshold, I have this value set at zero, but the > output is still showing that about 1/3 of my data is not assigned to a > cluster in my output. Am I using this value incorrectly? I did a > kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite > the clusterClassificationThreshold = 0. > >> > >> > >> Thanks, > >> > >> Mattie > >> > >> > >> -----Original Message----- > >> From: Ted Dunning [mailto:[email protected]] > >> Sent: Wednesday, August 15, 2012 5:20 PM > >> To: [email protected] > >> Subject: Re: Mahout-279/kmeans++ > >> > >> Mattie, > >> > >> Would this help? > >> > >> > https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java > >> > >> and > >> > >> > https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf > >> > >> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected] > >wrote: > >> > >>> Hi! > >>> > >>> I have been using RandomSeedGenerator, and was hoping it had a patch > like > >>> that described in Mahout-279 since I want only 10 vectors out of a set > of > >>> more than 100,000,000. I have been using canopy clustering for better > >>> results, but still need to do a few passes of kmeans to determine my > T, and > >>> the random seed does take a long time. > >>> > >>> The comments say that you are working on a kmeans++, I searched around > but > >>> couldn't confirm any more information about it. Is a scalable > kmeans++ in > >>> the works? (I know research on the subject is quite new) > >>> > >>> Thanks! > >>> > >>> > >>> > >>> Mattie Whitmore > >>> Mathematician/IR&D Software Engineer > >>> HARRIS Corporation - Advanced Information Solutions > >>> 301.837.5278 > >>> [email protected]<mailto:[email protected]> > >>> > >>> > >>> > >>> > > >
