One way to test this is to add a small amount of noise to all of your data
points. This won't be easy from the command line, but is easy from Java.
You can do this, for instance:
Vector v = // read data as a vector
Vector u = new DenseVector(v.size()).assign(Functions.random());
v.assign(u, Functions.plusMult(0.1));
On Wed, Aug 22, 2012 at 10:40 AM, Whitmore, Mattie <[email protected]>wrote:
> Yes, I have data which is exactly the same. If I give every vector a name
> which is distinct (albeit the data point is the same as other points in the
> set) will this keep the algorithm from dropping non-distinct vectors/data
> points (which is what I THINK but have yet to verify is what is going on)?
>
> Thanks,
>
> Mattie
>
> -----Original Message-----
> From: Ted Dunning [mailto:[email protected]]
> Sent: Wednesday, August 22, 2012 1:18 PM
> To: [email protected]
> Subject: Re: Mahout-279/kmeans++
>
> Just an off thought, do you have duplicate input points?
>
> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[email protected]
> >wrote:
>
> > ... I have also verified by running canopy multiple times with 0.5 and
> 0.7
> > that there is a continual discrepancy between the two clustering
> versions.
> > The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7
> is:
> > 921998/5. They should not necessarily be the same, since I am using
> canopy
> > clustering to find initial centroids, however I would think they would
> have
> > the same sum, which they do not (45901885 vs 1599154).
> >
> > Here is the method I am running:
> >
> > public static void KmeansClusteringCanopy(String outputDir, String T,
> > String itMax)
> > throws IOException, InterruptedException,
> > ClassNotFoundException,
> > InstantiationException, IllegalAccessException {
> >
> > Configuration conf = new Configuration();
> >
> > DistanceMeasure measure = new EuclideanDistanceMeasure();
> >
> > Path vectorsFolder = new Path(outputDir, "vectors");
> > Path clusterCenters = new Path(outputDir +
> > "-canopy/centriods");
> > Path clusterOutput = new Path(outputDir +
> > "-canopy/clusters");
> >
> > // create canopies instead of initial vectors
> > CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> > measure,
> > Double.parseDouble(T),
> > Double.parseDouble(T), false, 0, false);
> >
> >
> > // kmeans cluster operation
> > KMeansDriver.run(conf, vectorsFolder, new
> > Path(clusterCenters,
> > "clusters-0-final/part-r-00000"),
> > clusterOutput, measure, 0.01,
> > Integer.parseInt(itMax), true, 0.0,
> false);
> >
> >
> > //post process by putting completed clusters into their
> > own files.
> > ClusterOutputPostProcessorDriver.run(clusterOutput,
> > new
> > Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
> >
> > }
> >
> > What do you think?
> >
> > On another but related note: Is there a plan to have a method -- say
> > ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> > within clusters as well as a separate folder containing pruned outliers?
> >
> > Thanks!
> >
> > Mattie
> >
> > -----Original Message-----
> > From: Paritosh Ranjan [mailto:[email protected]]
> > Sent: Friday, August 17, 2012 12:16 PM
> > To: [email protected]
> > Subject: Re: Mahout-279/kmeans++
> >
> > The clustering algorithm has also changed internally. So, expect the
> > results to be different ( and better ).
> >
> > I can think of one reason for this behavior. Maybe lots of clusters are
> > having only one vector inside it, and, AFAIK, clusterdumper will not
> > output any cluster with single vector.
> > So, I think, its clusterdumper which is doing the invisible "pruning" (
> > by not ouputting clusters with single vectors ).
> >
> > Can you cross check the output once with
> ClusterOutputPostProcessorDriver?
> >
> > No, no tool can output the pruned vectors. The only way to see all
> > vectors assigned to any cluster is to set clusterClassificationThreshold
> > to 0.
> >
> > If you still face the problem, then please provide the parameters with
> > which you are calling kmeans.
> >
> > Regarding "I should also mention I have vectors which are exactly the
> > same (even their names), perhaps they are the ones being pruned, is that
> > possible? "
> >
> > The name of the vector has nothing to do with clustering, I am not sure
> > whether it will have any effect when clusterdumper is in action. So,
> > crosschecking with ClusterOutputPostProcessorDriver will answer this.
> >
> > Good luck.
> > Paritosh
> >
> > On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > > Sure, I have a dataset which I wish to cluster using Kmeans.
> Previously
> > (v0.5) when I did a clusterdump the total amount of vectors within the
> > resultant clusters was the same as the total amount fed to the algorithm.
> > I wish this to be the case when clustering with v0.7. The only change
> in
> > the algorithm is clusterClassificationThreshold, I set this value to be
> 0
> > so that it will in fact cluster all vectors in the dataset.
> > >
> > > My logic here was no vector should have a probability of being in some
> > cluster less than 0 and therefore all vectors should cluster.
> > >
> > > However after running a clusterdump I find that vectors (1/3 roughly)
> > have been pruned.
> > >
> > > Is this a bug, or me just not understanding the new capabilities?
> > >
> > > I should also mention I have vectors which are exactly the same (even
> > their names), perhaps they are the ones being pruned, is that possible?
> > >
> > > Another question if I may: I will eventually want to use the pruning
> > capabilities, does the ClusterOutputPostProcessorDriver method (or a
> > similar method) have the capability of outputting the pruned vectors
> into a
> > folder?
> > >
> > > Thanks! Please let me know if I'm still not being clear enough.
> > >
> > > Mattie
> > >
> > > -----Original Message-----
> > > From: Paritosh Ranjan [mailto:[email protected]]
> > > Sent: Friday, August 17, 2012 11:20 AM
> > > To: [email protected]
> > > Subject: Re: Mahout-279/kmeans++
> > >
> > > clusterClassificationThreshold is for outlier removal, and this is the
> > way it should be used.
> > >
> > > Can you provide some more information about your job and the way you
> are
> > calling it?
> > >
> > > And if I look at the code, the vector should be clustered even if the
> > pdf is 0. The method which decides whether the vector should be assigned
> to
> > a particular cluster or not -
> > >
> > > /**
> > > * Decides whether the vector should be classified or not based on
> > the max pdf
> > > * value of the clusters and threshold value.
> > > *
> > > * @return whether the vector should be classified or not.
> > > */
> > > private static boolean shouldClassify(Vector pdfPerCluster, Double
> > clusterClassificationThreshold) {
> > > return pdfPerCluster.maxValue() >=
> clusterClassificationThreshold;
> > > }
> > >
> > > On 17-08-2012 20:06, Whitmore, Mattie wrote:
> > >
> > >> Hi Ted,
> > >>
> > >> Yes this is great! I hope to start working with this algorithm in the
> > next couple weeks.
> > >>
> > >> I have a question about the 0.7 implementation of kmeans and the
> > clusterClassificationThreshold, I have this value set at zero, but the
> > output is still showing that about 1/3 of my data is not assigned to a
> > cluster in my output. Am I using this value incorrectly? I did a
> > kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned
> despite
> > the clusterClassificationThreshold = 0.
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> Mattie
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Ted Dunning [mailto:[email protected]]
> > >> Sent: Wednesday, August 15, 2012 5:20 PM
> > >> To: [email protected]
> > >> Subject: Re: Mahout-279/kmeans++
> > >>
> > >> Mattie,
> > >>
> > >> Would this help?
> > >>
> > >>
> >
> https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
> > >>
> > >> and
> > >>
> > >>
> >
> https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
> > >>
> > >> On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <
> [email protected]
> > >wrote:
> > >>
> > >>> Hi!
> > >>>
> > >>> I have been using RandomSeedGenerator, and was hoping it had a patch
> > like
> > >>> that described in Mahout-279 since I want only 10 vectors out of a
> set
> > of
> > >>> more than 100,000,000. I have been using canopy clustering for
> better
> > >>> results, but still need to do a few passes of kmeans to determine my
> > T, and
> > >>> the random seed does take a long time.
> > >>>
> > >>> The comments say that you are working on a kmeans++, I searched
> around
> > but
> > >>> couldn't confirm any more information about it. Is a scalable
> > kmeans++ in
> > >>> the works? (I know research on the subject is quite new)
> > >>>
> > >>> Thanks!
> > >>>
> > >>>
> > >>>
> > >>> Mattie Whitmore
> > >>> Mathematician/IR&D Software Engineer
> > >>> HARRIS Corporation - Advanced Information Solutions
> > >>> 301.837.5278
> > >>> [email protected]<mailto:[email protected]>
> > >>>
> > >>>
> > >>>
> > >>>
> > >
> >
>