Re: Mahout-279/kmeans++

Paritosh Ranjan Thu, 23 Aug 2012 09:34:17 -0700

clusterDump works in memory, and there are no plans yet to make it distributed 
( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940


clusterpp has an option for distributed processing, so you can process any 
amount of data with it.

On 23-08-2012 19:55, Whitmore, Mattie wrote:

Yes, unique names will be my next plan -- I just can't kick off that job until 
after the weekend.  If this makes no difference I will also try the noise idea, 
and I'll follow up about both.

My next question is regarding clusterDump.  Is there a way to run this in 
parallel? I have found some code to execute in java (the preferable method for 
me) but I would like the method to be faster and not in memory.  Is this a 
possibility? Or in the works?

Thanks!

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Wednesday, August 22, 2012 9:09 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

Can you also try to provide distinct names to vectors and then cluster?
It should not have any affect, but would be good to know the behavior.

On 22-08-2012 23:10, Whitmore, Mattie wrote:

Yes, I have data which is exactly the same.  If I give every vector a name 
which is distinct (albeit the data point is the same as other points in the 
set) will this keep the algorithm from dropping non-distinct vectors/data 
points (which is what I THINK but have yet to verify is what is going on)?

Thanks,

Mattie

-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Wednesday, August 22, 2012 1:18 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

Just an off thought, do you have duplicate input points?

On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[email protected]>wrote:

... I have also verified by running canopy multiple times with 0.5 and 0.7
that there is a continual discrepancy between the two clustering versions.
   The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
921998/5.  They should not necessarily be the same, since I am using canopy
clustering to find initial centroids, however I would think they would have
the same sum, which they do not (45901885 vs 1599154).

Here is the method I am running:

public static void KmeansClusteringCanopy(String outputDir, String T,
String itMax)
                          throws IOException, InterruptedException,
ClassNotFoundException,
                          InstantiationException, IllegalAccessException {

                  Configuration conf = new Configuration();

                  DistanceMeasure measure = new EuclideanDistanceMeasure();

                  Path vectorsFolder = new Path(outputDir, "vectors");
                  Path clusterCenters = new Path(outputDir +
"-canopy/centriods");
                  Path clusterOutput = new Path(outputDir +
"-canopy/clusters");

                  // create canopies instead of initial vectors
                  CanopyDriver.run(conf, vectorsFolder, clusterCenters,
measure,
                                  Double.parseDouble(T),
Double.parseDouble(T), false, 0, false);


                  // kmeans cluster operation
                  KMeansDriver.run(conf, vectorsFolder, new
Path(clusterCenters,
                                  "clusters-0-final/part-r-00000"),
clusterOutput, measure, 0.01,
                                  Integer.parseInt(itMax), true, 0.0, false);


                  //post process by putting completed clusters into their
own files.
                  ClusterOutputPostProcessorDriver.run(clusterOutput,
                                  new
Path(clusterOutput+"/CanopyClusterVectorFolders"), false);

          }

What do you think?

On another but related note: Is there a plan to have a method -- say
ClusterOutputPostProcessorDriver -- which when run outputs the vectors
within clusters as well as a separate folder containing pruned outliers?

Thanks!

Mattie

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Friday, August 17, 2012 12:16 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

The clustering algorithm has also changed internally. So, expect the
results to be different ( and better ).

I can think of one reason for this behavior. Maybe lots of clusters are
having only one vector inside it, and, AFAIK, clusterdumper will not
output any cluster with single vector.
So, I think, its clusterdumper which is doing the invisible "pruning" (
by not ouputting clusters with single vectors ).

Can you cross check the output once with ClusterOutputPostProcessorDriver?

No, no tool can output the pruned vectors. The only way to see all
vectors assigned to any cluster is to set clusterClassificationThreshold
to 0.

If you still face the problem, then please provide the parameters with
which you are calling kmeans.

Regarding "I should also mention I have vectors which are exactly the
same (even their names), perhaps they are the ones being pruned, is that
possible? "

The name of the vector has nothing to do with clustering, I am not sure
whether it will have any effect when clusterdumper is in action. So,
crosschecking with ClusterOutputPostProcessorDriver will answer this.

Good luck.
Paritosh

On 17-08-2012 21:07, Whitmore, Mattie wrote:

Sure, I have a dataset which I wish to cluster using Kmeans.  Previously

(v0.5) when I did a clusterdump the total amount of vectors within the
resultant clusters was the same as the total amount fed to the algorithm.
   I wish this to be the case when clustering with v0.7.  The only change in
the algorithm is clusterClassificationThreshold,  I set this value to be 0
so that it will in fact cluster all vectors in the dataset.

My logic here was no vector should have a probability of being in some

cluster less than 0 and therefore all vectors should cluster.

However after running a clusterdump I find that vectors (1/3 roughly)

have been pruned.

Is this a bug, or me just not understanding the new capabilities?

I should also mention I have vectors which are exactly the same (even

their names), perhaps they are the ones being pruned, is that possible?

Another question if I may: I will eventually want to use the pruning

capabilities, does the ClusterOutputPostProcessorDriver method (or a
similar method) have the capability of outputting the pruned vectors into a
folder?

Thanks! Please let me know if I'm still not being clear enough.

Mattie

-----Original Message-----
From: Paritosh Ranjan [mailto:[email protected]]
Sent: Friday, August 17, 2012 11:20 AM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

clusterClassificationThreshold is for outlier removal, and this is the

way it should be used.

Can you provide some more information about your job and the way you are

calling it?

And if I look at the code, the vector should be clustered even if the

pdf is 0. The method which decides whether the vector should be assigned to
a particular cluster or not -

/**
       * Decides whether the vector should be classified or not based on

the max pdf

       * value of the clusters and threshold value.
       *
       * @return whether the vector should be classified or not.
       */
      private static boolean shouldClassify(Vector pdfPerCluster, Double

clusterClassificationThreshold) {

        return pdfPerCluster.maxValue() >= clusterClassificationThreshold;
      }

On 17-08-2012 20:06, Whitmore, Mattie wrote:

Hi Ted,

Yes this is great!  I hope to start working with this algorithm in the

next couple weeks.

I have a question about the 0.7 implementation of kmeans and the

clusterClassificationThreshold,  I have this value set at zero, but the
output is still showing that about 1/3 of my data is not assigned to a
cluster in my output.  Am I using this value incorrectly?  I did a
kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite
the clusterClassificationThreshold = 0.

Thanks,

Mattie


-----Original Message-----
From: Ted Dunning [mailto:[email protected]]
Sent: Wednesday, August 15, 2012 5:20 PM
To: [email protected]
Subject: Re: Mahout-279/kmeans++

Mattie,

Would this help?

https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java

and

https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf

On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[email protected]

wrote:

Hi!

I have been using RandomSeedGenerator, and was hoping it had a patch

like

that described in Mahout-279 since I want only 10 vectors out of a set

of

more than 100,000,000.  I have been using canopy clustering for better
results, but still need to do a few passes of kmeans to determine my

T, and

the random seed does take a long time.

The comments say that you are working on a kmeans++, I searched around

but

couldn't confirm any more information about it.  Is a scalable

kmeans++ in

the works? (I know research on the subject is quite new)

Thanks!



Mattie Whitmore
Mathematician/IR&D Software Engineer
HARRIS  Corporation - Advanced Information Solutions
301.837.5278
[email protected]<mailto:[email protected]>

Re: Mahout-279/kmeans++

Reply via email to