Re: Mahout-279/kmeans++

Ted Dunning Thu, 30 Aug 2012 12:56:47 -0700

The names are outside the vector or matrix data.  Vectors and matrices
store numbers, not strings.


On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie <[email protected]>wrote:

> I was thinking that one column would be the name for each row -- like a
> "name column" for each vector in a matrix.  I probably mistyped somewhere
> in there :).  Would the algorithm implement better as if given a matrix?
> I'm thinking of work done on extending matrix multiplication to tensor
> multiplication I suppose. That is neither here nor there for this current
> project.
>
> Thanks for the guidance!
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:[email protected]]
> Sent: Thursday, August 30, 2012 2:52 PM
> To: [email protected]
> Subject: Re: Mahout-279/kmeans++
>
> But columns aren't what I would expect you to want labeled.  I think that
> row labels might be nicer.  Happily, each named vector has a name for the
> entire vector as well.
>
> On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <[email protected]>
> wrote:
>
> > The input to the BallKmeans is actually not a matrix.  It is an
> > Iterable<MatrixSlice>.  This can be a matrix since a matrix implements
> > this.
> >
> > So one way to deal with this is to build your own Iterable and put
> > NamedVectors into it.  NamedVector retain labels as you want.
> >
> >
> > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <[email protected]
> >wrote:
> >
> >> I need to be using the matrices for BallKmeans.  Can matrices be named?
> >> By this I mean can I assign a column of my matrix to be the "name" of
> each
> >> row?
> >>
> >> Thanks!
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:[email protected]]
> >> Sent: Wednesday, August 29, 2012 12:17 PM
> >> To: [email protected]
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Yes.  The ball k-means implementation does use weights to indicate
> >> multiple
> >> vectors.
> >>
> >> The implementation is definitely ready to test.  I would be slightly
> >> surprised if it has absolutely zero issues, but your feedback on such
> >> issues would help them get fixed much sooner than others.
> >>
> >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[email protected]
> >> >wrote:
> >>
> >> > I re-ran the canopy-kmeans analytic, this time with unique names, I
> lost
> >> > more points in the resulting clusters ( total points in the clusters =
> >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5).  The
> >> total
> >> > number of data points fed into the algorithm is 53365862 -- so even
> >> v0.5 is
> >> > missing 14% of the data.
> >> >
> >> > I'm thinking if I weight these dense vectors with a weight equal to
> the
> >> > number of identical vectors in the set that could work -- Ball Kmeans
> >> seems
> >> > to do this.  Is this a correct interpretation of how to use weights in
> >> Ball
> >> > Kmeans, and is Ball Kmeans ready enough to be used/tested?
> >> >
> >> > Thanks
> >> >
> >> > -----Original Message-----
> >> > From: Paritosh Ranjan [mailto:[email protected]]
> >> > Sent: Thursday, August 23, 2012 12:34 PM
> >> > To: [email protected]
> >> > Subject: Re: Mahout-279/kmeans++
> >> >
> >> > clusterDump works in memory, and there are no plans yet to make it
> >> > distributed ( or not in memory ). See thishttps://
> >> > issues.apache.org/*jira*/browse/MAHOUT-940
> >> >
> >> > clusterpp has an option for distributed processing, so you can process
> >> any
> >> > amount of data with it.
> >> >
> >> > On 23-08-2012 19:55, Whitmore, Mattie wrote:
> >> > > Yes, unique names will be my next plan -- I just can't kick off that
> >> job
> >> > until after the weekend.  If this makes no difference I will also try
> >> the
> >> > noise idea, and I'll follow up about both.
> >> > >
> >> > > My next question is regarding clusterDump.  Is there a way to run
> this
> >> > in parallel? I have found some code to execute in java (the preferable
> >> > method for me) but I would like the method to be faster and not in
> >> memory.
> >> >  Is this a possibility? Or in the works?
> >> > >
> >> > > Thanks!
> >> > >
> >> > > -----Original Message-----
> >> > > From: Paritosh Ranjan [mailto:[email protected]]
> >> > > Sent: Wednesday, August 22, 2012 9:09 PM
> >> > > To: [email protected]
> >> > > Subject: Re: Mahout-279/kmeans++
> >> > >
> >> > > Can you also try to provide distinct names to vectors and then
> >> cluster?
> >> > > It should not have any affect, but would be good to know the
> behavior.
> >> > >
> >> > > On 22-08-2012 23:10, Whitmore, Mattie wrote:
> >> > >> Yes, I have data which is exactly the same.  If I give every
> vector a
> >> > name which is distinct (albeit the data point is the same as other
> >> points
> >> > in the set) will this keep the algorithm from dropping non-distinct
> >> > vectors/data points (which is what I THINK but have yet to verify is
> >> what
> >> > is going on)?
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> Mattie
> >> > >>
> >> > >> -----Original Message-----
> >> > >> From: Ted Dunning [mailto:[email protected]]
> >> > >> Sent: Wednesday, August 22, 2012 1:18 PM
> >> > >> To: [email protected]
> >> > >> Subject: Re: Mahout-279/kmeans++
> >> > >>
> >> > >> Just an off thought, do you have duplicate input points?
> >> > >>
> >> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <
> >> [email protected]
> >> > >wrote:
> >> > >>
> >> > >>> ... I have also verified by running canopy multiple times with 0.5
> >> and
> >> > 0.7
> >> > >>> that there is a continual discrepancy between the two clustering
> >> > versions.
> >> > >>>    The max/min vectors in a cluster using 0.5 is: 19192158/215 and
> >> > 0.7 is:
> >> > >>> 921998/5.  They should not necessarily be the same, since I am
> using
> >> > canopy
> >> > >>> clustering to find initial centroids, however I would think they
> >> would
> >> > have
> >> > >>> the same sum, which they do not (45901885 vs 1599154).
> >> > >>>
> >> > >>> Here is the method I am running:
> >> > >>>
> >> > >>> public static void KmeansClusteringCanopy(String outputDir, String
> >> T,
> >> > >>> String itMax)
> >> > >>>                           throws IOException,
> InterruptedException,
> >> > >>> ClassNotFoundException,
> >> > >>>                           InstantiationException,
> >> > IllegalAccessException {
> >> > >>>
> >> > >>>                   Configuration conf = new Configuration();
> >> > >>>
> >> > >>>                   DistanceMeasure measure = new
> >> > EuclideanDistanceMeasure();
> >> > >>>
> >> > >>>                   Path vectorsFolder = new Path(outputDir,
> >> "vectors");
> >> > >>>                   Path clusterCenters = new Path(outputDir +
> >> > >>> "-canopy/centriods");
> >> > >>>                   Path clusterOutput = new Path(outputDir +
> >> > >>> "-canopy/clusters");
> >> > >>>
> >> > >>>                   // create canopies instead of initial vectors
> >> > >>>                   CanopyDriver.run(conf, vectorsFolder,
> >> clusterCenters,
> >> > >>> measure,
> >> > >>>                                   Double.parseDouble(T),
> >> > >>> Double.parseDouble(T), false, 0, false);
> >> > >>>
> >> > >>>
> >> > >>>                   // kmeans cluster operation
> >> > >>>                   KMeansDriver.run(conf, vectorsFolder, new
> >> > >>> Path(clusterCenters,
> >> > >>>
> "clusters-0-final/part-r-00000"),
> >> > >>> clusterOutput, measure, 0.01,
> >> > >>>                                   Integer.parseInt(itMax), true,
> >> 0.0,
> >> > false);
> >> > >>>
> >> > >>>
> >> > >>>                   //post process by putting completed clusters
> into
> >> > their
> >> > >>> own files.
> >> > >>>
> >> ClusterOutputPostProcessorDriver.run(clusterOutput,
> >> > >>>                                   new
> >> > >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> >> > >>>
> >> > >>>           }
> >> > >>>
> >> > >>> What do you think?
> >> > >>>
> >> > >>> On another but related note: Is there a plan to have a method --
> say
> >> > >>> ClusterOutputPostProcessorDriver -- which when run outputs the
> >> vectors
> >> > >>> within clusters as well as a separate folder containing pruned
> >> > outliers?
> >> > >>>
> >> > >>> Thanks!
> >> > >>>
> >> > >>> Mattie
> >> > >>>
> >> > >>> -----Original Message-----
> >> > >>> From: Paritosh Ranjan [mailto:[email protected]]
> >> > >>> Sent: Friday, August 17, 2012 12:16 PM
> >> > >>> To: [email protected]
> >> > >>> Subject: Re: Mahout-279/kmeans++
> >> > >>>
> >> > >>> The clustering algorithm has also changed internally. So, expect
> the
> >> > >>> results to be different ( and better ).
> >> > >>>
> >> > >>> I can think of one reason for this behavior. Maybe lots of
> clusters
> >> are
> >> > >>> having only one vector inside it, and, AFAIK, clusterdumper will
> not
> >> > >>> output any cluster with single vector.
> >> > >>> So, I think, its clusterdumper which is doing the invisible
> >> "pruning" (
> >> > >>> by not ouputting clusters with single vectors ).
> >> > >>>
> >> > >>> Can you cross check the output once with
> >> > ClusterOutputPostProcessorDriver?
> >> > >>>
> >> > >>> No, no tool can output the pruned vectors. The only way to see all
> >> > >>> vectors assigned to any cluster is to set
> >> > clusterClassificationThreshold
> >> > >>> to 0.
> >> > >>>
> >> > >>> If you still face the problem, then please provide the parameters
> >> with
> >> > >>> which you are calling kmeans.
> >> > >>>
> >> > >>> Regarding "I should also mention I have vectors which are exactly
> >> the
> >> > >>> same (even their names), perhaps they are the ones being pruned,
> is
> >> > that
> >> > >>> possible? "
> >> > >>>
> >> > >>> The name of the vector has nothing to do with clustering, I am not
> >> sure
> >> > >>> whether it will have any effect when clusterdumper is in action.
> So,
> >> > >>> crosschecking with ClusterOutputPostProcessorDriver will answer
> >> this.
> >> > >>>
> >> > >>> Good luck.
> >> > >>> Paritosh
> >> > >>>
> >> > >>> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> >> > >>>> Sure, I have a dataset which I wish to cluster using Kmeans.
> >> >  Previously
> >> > >>> (v0.5) when I did a clusterdump the total amount of vectors within
> >> the
> >> > >>> resultant clusters was the same as the total amount fed to the
> >> > algorithm.
> >> > >>>    I wish this to be the case when clustering with v0.7.  The only
> >> > change in
> >> > >>> the algorithm is clusterClassificationThreshold,  I set this value
> >> to
> >> > be 0
> >> > >>> so that it will in fact cluster all vectors in the dataset.
> >> > >>>> My logic here was no vector should have a probability of being in
> >> some
> >> > >>> cluster less than 0 and therefore all vectors should cluster.
> >> > >>>> However after running a clusterdump I find that vectors (1/3
> >> roughly)
> >> > >>> have been pruned.
> >> > >>>> Is this a bug, or me just not understanding the new capabilities?
> >> > >>>>
> >> > >>>> I should also mention I have vectors which are exactly the same
> >> (even
> >> > >>> their names), perhaps they are the ones being pruned, is that
> >> possible?
> >> > >>>> Another question if I may: I will eventually want to use the
> >> pruning
> >> > >>> capabilities, does the ClusterOutputPostProcessorDriver method
> (or a
> >> > >>> similar method) have the capability of outputting the pruned
> vectors
> >> > into a
> >> > >>> folder?
> >> > >>>> Thanks! Please let me know if I'm still not being clear enough.
> >> > >>>>
> >> > >>>> Mattie
> >> > >>>>
> >> > >>>> -----Original Message-----
> >> > >>>> From: Paritosh Ranjan [mailto:[email protected]]
> >> > >>>> Sent: Friday, August 17, 2012 11:20 AM
> >> > >>>> To: [email protected]
> >> > >>>> Subject: Re: Mahout-279/kmeans++
> >> > >>>>
> >> > >>>> clusterClassificationThreshold is for outlier removal, and this
> is
> >> the
> >> > >>> way it should be used.
> >> > >>>> Can you provide some more information about your job and the way
> >> you
> >> > are
> >> > >>> calling it?
> >> > >>>> And if I look at the code, the vector should be clustered even if
> >> the
> >> > >>> pdf is 0. The method which decides whether the vector should be
> >> > assigned to
> >> > >>> a particular cluster or not -
> >> > >>>> /**
> >> > >>>>        * Decides whether the vector should be classified or not
> >> based
> >> > on
> >> > >>> the max pdf
> >> > >>>>        * value of the clusters and threshold value.
> >> > >>>>        *
> >> > >>>>        * @return whether the vector should be classified or not.
> >> > >>>>        */
> >> > >>>>       private static boolean shouldClassify(Vector pdfPerCluster,
> >> > Double
> >> > >>> clusterClassificationThreshold) {
> >> > >>>>         return pdfPerCluster.maxValue() >=
> >> > clusterClassificationThreshold;
> >> > >>>>       }
> >> > >>>>
> >> > >>>> On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >> > >>>>
> >> > >>>>> Hi Ted,
> >> > >>>>>
> >> > >>>>> Yes this is great!  I hope to start working with this algorithm
> in
> >> > the
> >> > >>> next couple weeks.
> >> > >>>>> I have a question about the 0.7 implementation of kmeans and the
> >> > >>> clusterClassificationThreshold,  I have this value set at zero,
> but
> >> the
> >> > >>> output is still showing that about 1/3 of my data is not assigned
> >> to a
> >> > >>> cluster in my output.  Am I using this value incorrectly?  I did a
> >> > >>> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> >> > despite
> >> > >>> the clusterClassificationThreshold = 0.
> >> > >>>>> Thanks,
> >> > >>>>>
> >> > >>>>> Mattie
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> -----Original Message-----
> >> > >>>>> From: Ted Dunning [mailto:[email protected]]
> >> > >>>>> Sent: Wednesday, August 15, 2012 5:20 PM
> >> > >>>>> To: [email protected]
> >> > >>>>> Subject: Re: Mahout-279/kmeans++
> >> > >>>>>
> >> > >>>>> Mattie,
> >> > >>>>>
> >> > >>>>> Would this help?
> >> > >>>>>
> >> > >>>>>
> >> > >>>
> >> >
> >>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >> > >>>>> and
> >> > >>>>>
> >> > >>>>>
> >> > >>>
> >> >
> >>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >> > >>>>> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> >> > [email protected]
> >> > >>>> wrote:
> >> > >>>>>> Hi!
> >> > >>>>>>
> >> > >>>>>> I have been using RandomSeedGenerator, and was hoping it had a
> >> patch
> >> > >>> like
> >> > >>>>>> that described in Mahout-279 since I want only 10 vectors out
> of
> >> a
> >> > set
> >> > >>> of
> >> > >>>>>> more than 100,000,000.  I have been using canopy clustering for
> >> > better
> >> > >>>>>> results, but still need to do a few passes of kmeans to
> >> determine my
> >> > >>> T, and
> >> > >>>>>> the random seed does take a long time.
> >> > >>>>>>
> >> > >>>>>> The comments say that you are working on a kmeans++, I searched
> >> > around
> >> > >>> but
> >> > >>>>>> couldn't confirm any more information about it.  Is a scalable
> >> > >>> kmeans++ in
> >> > >>>>>> the works? (I know research on the subject is quite new)
> >> > >>>>>>
> >> > >>>>>> Thanks!
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> Mattie Whitmore
> >> > >>>>>> Mathematician/IR&D Software Engineer
> >> > >>>>>> HARRIS  Corporation - Advanced Information Solutions
> >> > >>>>>> 301.837.5278
> >> > >>>>>> [email protected]<mailto:[email protected]>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>>
> >> > >
> >> >
> >> >
> >> >
> >>
> >
> >
>

Re: Mahout-279/kmeans++

Reply via email to