The names are outside the vector or matrix data. Vectors and matrices store numbers, not strings.
On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie <[email protected]>wrote: > I was thinking that one column would be the name for each row -- like a > "name column" for each vector in a matrix. I probably mistyped somewhere > in there :). Would the algorithm implement better as if given a matrix? > I'm thinking of work done on extending matrix multiplication to tensor > multiplication I suppose. That is neither here nor there for this current > project. > > Thanks for the guidance! > > > -----Original Message----- > From: Ted Dunning [mailto:[email protected]] > Sent: Thursday, August 30, 2012 2:52 PM > To: [email protected] > Subject: Re: Mahout-279/kmeans++ > > But columns aren't what I would expect you to want labeled. I think that > row labels might be nicer. Happily, each named vector has a name for the > entire vector as well. > > On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <[email protected]> > wrote: > > > The input to the BallKmeans is actually not a matrix. It is an > > Iterable<MatrixSlice>. This can be a matrix since a matrix implements > > this. > > > > So one way to deal with this is to build your own Iterable and put > > NamedVectors into it. NamedVector retain labels as you want. > > > > > > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <[email protected] > >wrote: > > > >> I need to be using the matrices for BallKmeans. Can matrices be named? > >> By this I mean can I assign a column of my matrix to be the "name" of > each > >> row? > >> > >> Thanks! > >> > >> -----Original Message----- > >> From: Ted Dunning [mailto:[email protected]] > >> Sent: Wednesday, August 29, 2012 12:17 PM > >> To: [email protected] > >> Subject: Re: Mahout-279/kmeans++ > >> > >> Yes. The ball k-means implementation does use weights to indicate > >> multiple > >> vectors. > >> > >> The implementation is definitely ready to test. I would be slightly > >> surprised if it has absolutely zero issues, but your feedback on such > >> issues would help them get fixed much sooner than others. > >> > >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[email protected] > >> >wrote: > >> > >> > I re-ran the canopy-kmeans analytic, this time with unique names, I > lost > >> > more points in the resulting clusters ( total points in the clusters = > >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The > >> total > >> > number of data points fed into the algorithm is 53365862 -- so even > >> v0.5 is > >> > missing 14% of the data. > >> > > >> > I'm thinking if I weight these dense vectors with a weight equal to > the > >> > number of identical vectors in the set that could work -- Ball Kmeans > >> seems > >> > to do this. Is this a correct interpretation of how to use weights in > >> Ball > >> > Kmeans, and is Ball Kmeans ready enough to be used/tested? > >> > > >> > Thanks > >> > > >> > -----Original Message----- > >> > From: Paritosh Ranjan [mailto:[email protected]] > >> > Sent: Thursday, August 23, 2012 12:34 PM > >> > To: [email protected] > >> > Subject: Re: Mahout-279/kmeans++ > >> > > >> > clusterDump works in memory, and there are no plans yet to make it > >> > distributed ( or not in memory ). See thishttps:// > >> > issues.apache.org/*jira*/browse/MAHOUT-940 > >> > > >> > clusterpp has an option for distributed processing, so you can process > >> any > >> > amount of data with it. > >> > > >> > On 23-08-2012 19:55, Whitmore, Mattie wrote: > >> > > Yes, unique names will be my next plan -- I just can't kick off that > >> job > >> > until after the weekend. If this makes no difference I will also try > >> the > >> > noise idea, and I'll follow up about both. > >> > > > >> > > My next question is regarding clusterDump. Is there a way to run > this > >> > in parallel? I have found some code to execute in java (the preferable > >> > method for me) but I would like the method to be faster and not in > >> memory. > >> > Is this a possibility? Or in the works? > >> > > > >> > > Thanks! > >> > > > >> > > -----Original Message----- > >> > > From: Paritosh Ranjan [mailto:[email protected]] > >> > > Sent: Wednesday, August 22, 2012 9:09 PM > >> > > To: [email protected] > >> > > Subject: Re: Mahout-279/kmeans++ > >> > > > >> > > Can you also try to provide distinct names to vectors and then > >> cluster? > >> > > It should not have any affect, but would be good to know the > behavior. > >> > > > >> > > On 22-08-2012 23:10, Whitmore, Mattie wrote: > >> > >> Yes, I have data which is exactly the same. If I give every > vector a > >> > name which is distinct (albeit the data point is the same as other > >> points > >> > in the set) will this keep the algorithm from dropping non-distinct > >> > vectors/data points (which is what I THINK but have yet to verify is > >> what > >> > is going on)? > >> > >> > >> > >> Thanks, > >> > >> > >> > >> Mattie > >> > >> > >> > >> -----Original Message----- > >> > >> From: Ted Dunning [mailto:[email protected]] > >> > >> Sent: Wednesday, August 22, 2012 1:18 PM > >> > >> To: [email protected] > >> > >> Subject: Re: Mahout-279/kmeans++ > >> > >> > >> > >> Just an off thought, do you have duplicate input points? > >> > >> > >> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie < > >> [email protected] > >> > >wrote: > >> > >> > >> > >>> ... I have also verified by running canopy multiple times with 0.5 > >> and > >> > 0.7 > >> > >>> that there is a continual discrepancy between the two clustering > >> > versions. > >> > >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and > >> > 0.7 is: > >> > >>> 921998/5. They should not necessarily be the same, since I am > using > >> > canopy > >> > >>> clustering to find initial centroids, however I would think they > >> would > >> > have > >> > >>> the same sum, which they do not (45901885 vs 1599154). > >> > >>> > >> > >>> Here is the method I am running: > >> > >>> > >> > >>> public static void KmeansClusteringCanopy(String outputDir, String > >> T, > >> > >>> String itMax) > >> > >>> throws IOException, > InterruptedException, > >> > >>> ClassNotFoundException, > >> > >>> InstantiationException, > >> > IllegalAccessException { > >> > >>> > >> > >>> Configuration conf = new Configuration(); > >> > >>> > >> > >>> DistanceMeasure measure = new > >> > EuclideanDistanceMeasure(); > >> > >>> > >> > >>> Path vectorsFolder = new Path(outputDir, > >> "vectors"); > >> > >>> Path clusterCenters = new Path(outputDir + > >> > >>> "-canopy/centriods"); > >> > >>> Path clusterOutput = new Path(outputDir + > >> > >>> "-canopy/clusters"); > >> > >>> > >> > >>> // create canopies instead of initial vectors > >> > >>> CanopyDriver.run(conf, vectorsFolder, > >> clusterCenters, > >> > >>> measure, > >> > >>> Double.parseDouble(T), > >> > >>> Double.parseDouble(T), false, 0, false); > >> > >>> > >> > >>> > >> > >>> // kmeans cluster operation > >> > >>> KMeansDriver.run(conf, vectorsFolder, new > >> > >>> Path(clusterCenters, > >> > >>> > "clusters-0-final/part-r-00000"), > >> > >>> clusterOutput, measure, 0.01, > >> > >>> Integer.parseInt(itMax), true, > >> 0.0, > >> > false); > >> > >>> > >> > >>> > >> > >>> //post process by putting completed clusters > into > >> > their > >> > >>> own files. > >> > >>> > >> ClusterOutputPostProcessorDriver.run(clusterOutput, > >> > >>> new > >> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); > >> > >>> > >> > >>> } > >> > >>> > >> > >>> What do you think? > >> > >>> > >> > >>> On another but related note: Is there a plan to have a method -- > say > >> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the > >> vectors > >> > >>> within clusters as well as a separate folder containing pruned > >> > outliers? > >> > >>> > >> > >>> Thanks! > >> > >>> > >> > >>> Mattie > >> > >>> > >> > >>> -----Original Message----- > >> > >>> From: Paritosh Ranjan [mailto:[email protected]] > >> > >>> Sent: Friday, August 17, 2012 12:16 PM > >> > >>> To: [email protected] > >> > >>> Subject: Re: Mahout-279/kmeans++ > >> > >>> > >> > >>> The clustering algorithm has also changed internally. So, expect > the > >> > >>> results to be different ( and better ). > >> > >>> > >> > >>> I can think of one reason for this behavior. Maybe lots of > clusters > >> are > >> > >>> having only one vector inside it, and, AFAIK, clusterdumper will > not > >> > >>> output any cluster with single vector. > >> > >>> So, I think, its clusterdumper which is doing the invisible > >> "pruning" ( > >> > >>> by not ouputting clusters with single vectors ). > >> > >>> > >> > >>> Can you cross check the output once with > >> > ClusterOutputPostProcessorDriver? > >> > >>> > >> > >>> No, no tool can output the pruned vectors. The only way to see all > >> > >>> vectors assigned to any cluster is to set > >> > clusterClassificationThreshold > >> > >>> to 0. > >> > >>> > >> > >>> If you still face the problem, then please provide the parameters > >> with > >> > >>> which you are calling kmeans. > >> > >>> > >> > >>> Regarding "I should also mention I have vectors which are exactly > >> the > >> > >>> same (even their names), perhaps they are the ones being pruned, > is > >> > that > >> > >>> possible? " > >> > >>> > >> > >>> The name of the vector has nothing to do with clustering, I am not > >> sure > >> > >>> whether it will have any effect when clusterdumper is in action. > So, > >> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer > >> this. > >> > >>> > >> > >>> Good luck. > >> > >>> Paritosh > >> > >>> > >> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote: > >> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans. > >> > Previously > >> > >>> (v0.5) when I did a clusterdump the total amount of vectors within > >> the > >> > >>> resultant clusters was the same as the total amount fed to the > >> > algorithm. > >> > >>> I wish this to be the case when clustering with v0.7. The only > >> > change in > >> > >>> the algorithm is clusterClassificationThreshold, I set this value > >> to > >> > be 0 > >> > >>> so that it will in fact cluster all vectors in the dataset. > >> > >>>> My logic here was no vector should have a probability of being in > >> some > >> > >>> cluster less than 0 and therefore all vectors should cluster. > >> > >>>> However after running a clusterdump I find that vectors (1/3 > >> roughly) > >> > >>> have been pruned. > >> > >>>> Is this a bug, or me just not understanding the new capabilities? > >> > >>>> > >> > >>>> I should also mention I have vectors which are exactly the same > >> (even > >> > >>> their names), perhaps they are the ones being pruned, is that > >> possible? > >> > >>>> Another question if I may: I will eventually want to use the > >> pruning > >> > >>> capabilities, does the ClusterOutputPostProcessorDriver method > (or a > >> > >>> similar method) have the capability of outputting the pruned > vectors > >> > into a > >> > >>> folder? > >> > >>>> Thanks! Please let me know if I'm still not being clear enough. > >> > >>>> > >> > >>>> Mattie > >> > >>>> > >> > >>>> -----Original Message----- > >> > >>>> From: Paritosh Ranjan [mailto:[email protected]] > >> > >>>> Sent: Friday, August 17, 2012 11:20 AM > >> > >>>> To: [email protected] > >> > >>>> Subject: Re: Mahout-279/kmeans++ > >> > >>>> > >> > >>>> clusterClassificationThreshold is for outlier removal, and this > is > >> the > >> > >>> way it should be used. > >> > >>>> Can you provide some more information about your job and the way > >> you > >> > are > >> > >>> calling it? > >> > >>>> And if I look at the code, the vector should be clustered even if > >> the > >> > >>> pdf is 0. The method which decides whether the vector should be > >> > assigned to > >> > >>> a particular cluster or not - > >> > >>>> /** > >> > >>>> * Decides whether the vector should be classified or not > >> based > >> > on > >> > >>> the max pdf > >> > >>>> * value of the clusters and threshold value. > >> > >>>> * > >> > >>>> * @return whether the vector should be classified or not. > >> > >>>> */ > >> > >>>> private static boolean shouldClassify(Vector pdfPerCluster, > >> > Double > >> > >>> clusterClassificationThreshold) { > >> > >>>> return pdfPerCluster.maxValue() >= > >> > clusterClassificationThreshold; > >> > >>>> } > >> > >>>> > >> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote: > >> > >>>> > >> > >>>>> Hi Ted, > >> > >>>>> > >> > >>>>> Yes this is great! I hope to start working with this algorithm > in > >> > the > >> > >>> next couple weeks. > >> > >>>>> I have a question about the 0.7 implementation of kmeans and the > >> > >>> clusterClassificationThreshold, I have this value set at zero, > but > >> the > >> > >>> output is still showing that about 1/3 of my data is not assigned > >> to a > >> > >>> cluster in my output. Am I using this value incorrectly? I did a > >> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned > >> > despite > >> > >>> the clusterClassificationThreshold = 0. > >> > >>>>> Thanks, > >> > >>>>> > >> > >>>>> Mattie > >> > >>>>> > >> > >>>>> > >> > >>>>> -----Original Message----- > >> > >>>>> From: Ted Dunning [mailto:[email protected]] > >> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM > >> > >>>>> To: [email protected] > >> > >>>>> Subject: Re: Mahout-279/kmeans++ > >> > >>>>> > >> > >>>>> Mattie, > >> > >>>>> > >> > >>>>> Would this help? > >> > >>>>> > >> > >>>>> > >> > >>> > >> > > >> > https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java > >> > >>>>> and > >> > >>>>> > >> > >>>>> > >> > >>> > >> > > >> > https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf > >> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie < > >> > [email protected] > >> > >>>> wrote: > >> > >>>>>> Hi! > >> > >>>>>> > >> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a > >> patch > >> > >>> like > >> > >>>>>> that described in Mahout-279 since I want only 10 vectors out > of > >> a > >> > set > >> > >>> of > >> > >>>>>> more than 100,000,000. I have been using canopy clustering for > >> > better > >> > >>>>>> results, but still need to do a few passes of kmeans to > >> determine my > >> > >>> T, and > >> > >>>>>> the random seed does take a long time. > >> > >>>>>> > >> > >>>>>> The comments say that you are working on a kmeans++, I searched > >> > around > >> > >>> but > >> > >>>>>> couldn't confirm any more information about it. Is a scalable > >> > >>> kmeans++ in > >> > >>>>>> the works? (I know research on the subject is quite new) > >> > >>>>>> > >> > >>>>>> Thanks! > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> Mattie Whitmore > >> > >>>>>> Mathematician/IR&D Software Engineer > >> > >>>>>> HARRIS Corporation - Advanced Information Solutions > >> > >>>>>> 301.837.5278 > >> > >>>>>> [email protected]<mailto:[email protected]> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > >>>>>> > >> > > > >> > > >> > > >> > > >> > > > > >
