RE: Mahout-279/kmeans++

Whitmore, Mattie Wed, 22 Aug 2012 10:40:45 -0700

Yes, I have data which is exactly the same.  If I give every vector a name 
which is distinct (albeit the data point is the same as other points in the 
set) will this keep the algorithm from dropping non-distinct vectors/data 
points (which is what I THINK but have yet to verify is what is going on)?


Thanks,

Mattie

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Wednesday, August 22, 2012 1:18 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

Just an off thought, do you have duplicate input points?

On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[email protected]>wrote:

> ... I have also verified by running canopy multiple times with 0.5 and 0.7
> that there is a continual discrepancy between the two clustering versions.
>  The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
> 921998/5.  They should not necessarily be the same, since I am using canopy
> clustering to find initial centroids, however I would think they would have
> the same sum, which they do not (45901885 vs 1599154).
>
> Here is the method I am running:
>
> public static void KmeansClusteringCanopy(String outputDir, String T,
> String itMax)
>                         throws IOException, InterruptedException,
> ClassNotFoundException,
>                         InstantiationException, IllegalAccessException {
>
>                 Configuration conf = new Configuration();
>
>                 DistanceMeasure measure = new EuclideanDistanceMeasure();
>
>                 Path vectorsFolder = new Path(outputDir, "vectors");
>                 Path clusterCenters = new Path(outputDir +
> "-canopy/centriods");
>                 Path clusterOutput = new Path(outputDir +
> "-canopy/clusters");
>
>                 // create canopies instead of initial vectors
>                 CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> measure,
>                                 Double.parseDouble(T),
> Double.parseDouble(T), false, 0, false);
>
>
>                 // kmeans cluster operation
>                 KMeansDriver.run(conf, vectorsFolder, new
> Path(clusterCenters,
>                                 "clusters-0-final/part-r-00000"),
> clusterOutput, measure, 0.01,
>                                 Integer.parseInt(itMax), true, 0.0, false);
>
>
>                 //post process by putting completed clusters into their
> own files.
>                 ClusterOutputPostProcessorDriver.run(clusterOutput,
>                                 new
> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>
>         }
>
> What do you think?
>
> On another but related note: Is there a plan to have a method -- say
> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> within clusters as well as a separate folder containing pruned outliers?
>
> Thanks!
>
> Mattie
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:[email protected]]
> Sent: Friday, August 17, 2012 12:16 PM
> To: [email protected]
> Subject: Re: Mahout-279/kmeans++
>
> The clustering algorithm has also changed internally. So, expect the
> results to be different ( and better ).
>
> I can think of one reason for this behavior. Maybe lots of clusters are
> having only one vector inside it, and, AFAIK, clusterdumper will not
> output any cluster with single vector.
> So, I think, its clusterdumper which is doing the invisible "pruning" (
> by not ouputting clusters with single vectors ).
>
> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>
> No, no tool can output the pruned vectors. The only way to see all
> vectors assigned to any cluster is to set clusterClassificationThreshold
> to 0.
>
> If you still face the problem, then please provide the parameters with
> which you are calling kmeans.
>
> Regarding "I should also mention I have vectors which are exactly the
> same (even their names), perhaps they are the ones being pruned, is that
> possible? "
>
> The name of the vector has nothing to do with clustering, I am not sure
> whether it will have any effect when clusterdumper is in action. So,
> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>
> Good luck.
> Paritosh
>
> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
> (v0.5) when I did a clusterdump the total amount of vectors within the
> resultant clusters was the same as the total amount fed to the algorithm.
>  I wish this to be the case when clustering with v0.7.  The only change in
> the algorithm is clusterClassificationThreshold,  I set this value to be 0
> so that it will in fact cluster all vectors in the dataset.
> >
> > My logic here was no vector should have a probability of being in some
> cluster less than 0 and therefore all vectors should cluster.
> >
> > However after running a clusterdump I find that vectors (1/3 roughly)
> have been pruned.
> >
> > Is this a bug, or me just not understanding the new capabilities?
> >
> > I should also mention I have vectors which are exactly the same (even
> their names), perhaps they are the ones being pruned, is that possible?
> >
> > Another question if I may: I will eventually want to use the pruning
> capabilities, does the ClusterOutputPostProcessorDriver method (or a
> similar method) have the capability of outputting the pruned vectors into a
> folder?
> >
> > Thanks! Please let me know if I'm still not being clear enough.
> >
> > Mattie
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:[email protected]]
> > Sent: Friday, August 17, 2012 11:20 AM
> > To: [email protected]
> > Subject: Re: Mahout-279/kmeans++
> >
> > clusterClassificationThreshold is for outlier removal, and this is the
> way it should be used.
> >
> > Can you provide some more information about your job and the way you are
> calling it?
> >
> > And if I look at the code, the vector should be clustered even if the
> pdf is 0. The method which decides whether the vector should be assigned to
> a particular cluster or not -
> >
> > /**
> >      * Decides whether the vector should be classified or not based on
> the max pdf
> >      * value of the clusters and threshold value.
> >      *
> >      * @return whether the vector should be classified or not.
> >      */
> >     private static boolean shouldClassify(Vector pdfPerCluster, Double
> clusterClassificationThreshold) {
> >       return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
> >     }
> >
> > On 17-08-2012 20:06, Whitmore, Mattie wrote:
> >
> >> Hi Ted,
> >>
> >> Yes this is great!  I hope to start working with this algorithm in the
> next couple weeks.
> >>
> >> I have a question about the 0.7 implementation of kmeans and the
> clusterClassificationThreshold,  I have this value set at zero, but the
> output is still showing that about 1/3 of my data is not assigned to a
> cluster in my output.  Am I using this value incorrectly?  I did a
> kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
> the clusterClassificationThreshold = 0.
> >>
> >>
> >> Thanks,
> >>
> >> Mattie
> >>
> >>
> >> -----Original Message-----
> >> From: Ted Dunning [mailto:[email protected]]
> >> Sent: Wednesday, August 15, 2012 5:20 PM
> >> To: [email protected]
> >> Subject: Re: Mahout-279/kmeans++
> >>
> >> Mattie,
> >>
> >> Would this help?
> >>
> >>
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> >>
> >> and
> >>
> >>
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> >>
> >> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]
> >wrote:
> >>
> >>> Hi!
> >>>
> >>> I have been using RandomSeedGenerator, and was hoping it had a patch
> like
> >>> that described in Mahout-279 since I want only 10 vectors out of a set
> of
> >>> more than 100,000,000.  I have been using canopy clustering for better
> >>> results, but still need to do a few passes of kmeans to determine my
> T, and
> >>> the random seed does take a long time.
> >>>
> >>> The comments say that you are working on a kmeans++, I searched around
> but
> >>> couldn't confirm any more information about it.  Is a scalable
> kmeans++ in
> >>> the works? (I know research on the subject is quite new)
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> Mattie Whitmore
> >>> Mathematician/IR&D Software Engineer
> >>> HARRIS  Corporation - Advanced Information Solutions
> >>> 301.837.5278
> >>> [email protected]<mailto:[email protected]>
> >>>
> >>>
> >>>
> >>>
> >
>

RE: Mahout-279/kmeans++

Reply via email to