... I have also verified by running canopy multiple times with 0.5 and 0.7
that there is a continual discrepancy between the two clustering versions.
The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is:
921998/5. They should not necessarily be the same, since I am using canopy
clustering to find initial centroids, however I would think they would have
the same sum, which they do not (45901885 vs 1599154).
Here is the method I am running:
public static void KmeansClusteringCanopy(String outputDir, String T,
String itMax)
throws IOException, InterruptedException,
ClassNotFoundException,
InstantiationException, IllegalAccessException {
Configuration conf = new Configuration();
DistanceMeasure measure = new EuclideanDistanceMeasure();
Path vectorsFolder = new Path(outputDir, "vectors");
Path clusterCenters = new Path(outputDir +
"-canopy/centriods");
Path clusterOutput = new Path(outputDir +
"-canopy/clusters");
// create canopies instead of initial vectors
CanopyDriver.run(conf, vectorsFolder, clusterCenters,
measure,
Double.parseDouble(T),
Double.parseDouble(T), false, 0, false);
// kmeans cluster operation
KMeansDriver.run(conf, vectorsFolder, new
Path(clusterCenters,
"clusters-0-final/part-r-00000"),
clusterOutput, measure, 0.01,
Integer.parseInt(itMax), true, 0.0, false);
//post process by putting completed clusters into their
own files.
ClusterOutputPostProcessorDriver.run(clusterOutput,
new
Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
}
What do you think?
On another but related note: Is there a plan to have a method -- say
ClusterOutputPostProcessorDriver -- which when run outputs the vectors
within clusters as well as a separate folder containing pruned outliers?
Thanks!
Mattie
-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Friday, August 17, 2012 12:16 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++
The clustering algorithm has also changed internally. So, expect the
results to be different ( and better ).
I can think of one reason for this behavior. Maybe lots of clusters are
having only one vector inside it, and, AFAIK, clusterdumper will not
output any cluster with single vector.
So, I think, its clusterdumper which is doing the invisible "pruning" (
by not ouputting clusters with single vectors ).
Can you cross check the output once with ClusterOutputPostProcessorDriver?
No, no tool can output the pruned vectors. The only way to see all
vectors assigned to any cluster is to set clusterClassificationThreshold
to 0.
If you still face the problem, then please provide the parameters with
which you are calling kmeans.
Regarding "I should also mention I have vectors which are exactly the
same (even their names), perhaps they are the ones being pruned, is that
possible? "
The name of the vector has nothing to do with clustering, I am not sure
whether it will have any effect when clusterdumper is in action. So,
crosschecking with ClusterOutputPostProcessorDriver will answer this.
Good luck.
Paritosh
On 17-08-2012 21:07, Whitmore, Mattie wrote:
Sure, I have a dataset which I wish to cluster using Kmeans. Previously
(v0.5) when I did a clusterdump the total amount of vectors within the
resultant clusters was the same as the total amount fed to the algorithm.
I wish this to be the case when clustering with v0.7. The only change in
the algorithm is clusterClassificationThreshold, I set this value to be 0
so that it will in fact cluster all vectors in the dataset.
My logic here was no vector should have a probability of being in some
cluster less than 0 and therefore all vectors should cluster.
However after running a clusterdump I find that vectors (1/3 roughly)
have been pruned.
Is this a bug, or me just not understanding the new capabilities?
I should also mention I have vectors which are exactly the same (even
their names), perhaps they are the ones being pruned, is that possible?
Another question if I may: I will eventually want to use the pruning
capabilities, does the ClusterOutputPostProcessorDriver method (or a
similar method) have the capability of outputting the pruned vectors into a
folder?
Thanks! Please let me know if I'm still not being clear enough.
Mattie
-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Friday, August 17, 2012 11:20 AM
To: [email protected]
Subject: Re: Mahout-279/kmeans++
clusterClassificationThreshold is for outlier removal, and this is the
way it should be used.
Can you provide some more information about your job and the way you are
calling it?
And if I look at the code, the vector should be clustered even if the
pdf is 0. The method which decides whether the vector should be assigned to
a particular cluster or not -
/**
* Decides whether the vector should be classified or not based on
the max pdf
* value of the clusters and threshold value.
*
* @return whether the vector should be classified or not.
*/
private static boolean shouldClassify(Vector pdfPerCluster, Double
clusterClassificationThreshold) {
return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
}
On 17-08-2012 20:06, Whitmore, Mattie wrote:
Hi Ted,
Yes this is great! I hope to start working with this algorithm in the
next couple weeks.
I have a question about the 0.7 implementation of kmeans and the
clusterClassificationThreshold, I have this value set at zero, but the
output is still showing that about 1/3 of my data is not assigned to a
cluster in my output. Am I using this value incorrectly? I did a
kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
the clusterClassificationThreshold = 0.
Thanks,
Mattie
-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Wednesday, August 15, 2012 5:20 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++
Mattie,
Would this help?
https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java
and
https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf
On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]
wrote:
Hi!
I have been using RandomSeedGenerator, and was hoping it had a patch
like
that described in Mahout-279 since I want only 10 vectors out of a set
of
more than 100,000,000. I have been using canopy clustering for better
results, but still need to do a few passes of kmeans to determine my
T, and
the random seed does take a long time.
The comments say that you are working on a kmeans++, I searched around
but
couldn't confirm any more information about it. Is a scalable
kmeans++ in
the works? (I know research on the subject is quite new)
Thanks!
Mattie Whitmore
Mathematician/IR&D Software Engineer
HARRIS Corporation - Advanced Information Solutions
301.837.5278
[email protected]<mailto:[email protected]>