Re: Understanding Mahout KMeans

Jeff Eastman Wed, 15 Aug 2012 18:15:16 -0700

1. True, the KMeansCombiner was removed and the new clusteringimplementations don't use combiners. Instead, all of the points assignedto a cluster by the mapper are observed() by that cluster and theclusters with their raw observation statistics are passed through toeach reducer. The number of clusters has to fit in memory in each mapperanyway and counting the observations there is a lot less plumbing thanwith a combiner (which might or might not be run at all). All theclusters are output (k records) at the end of each mapper's cleanup()method, keyed by the clusterId.

1*. Each reducer then receives #mappers Clusters. It takes the firstone, with its observation statistics, and then observes all of theremaining Clusters with that distinguished Cluster. Thatobserve(Cluster) method does the summing of the observation statics. Atthe end of processing each key, a new ClusterClassifier is created onthe one distinguished cluster and its close() method callscomputeParameters() before it is output.

2. No, I don't think so. Observing a vector with an empty cluster willadd its observation statistics and then computeParameters() willproperly set its centroid before it is output.



On 8/15/12 8:50 PM, Lance Norskog wrote:

It is possible to run the M/R jobs inside Eclipse or another IDE with
small datasets. I learned a lot from single-stepping through some of
the more complex code.

On Wed, Aug 15, 2012 at 10:08 AM, Aniruddha Basak <[email protected]> wrote:

Hi,
I am trying to understand the Kmeans implementation in Mahout.
Few questions appear in my mind:

  1.  In the ClusterIteration.IterateMR(), no combiner class has been declared. 
Looking at CIMapper and CIReducer, I could not find out where the new centroids 
are computed at the end of each iteration?
     *   I expected at some point the "SUM" (as in Cluster.S1) of the points 
assigned to a cluster will be divided by the number of points (Cluster.S0). The 
computeCentroid() method in AbstractCluster class does that but I could not find whether 
it was called or not.
  2.  While generating the cluster centroids as initial guess i.e 
RandomSeedGenerator.buildRandom(), why the observer() method was called for 
each cluster? I noticed this observe() method records the sum of points 
assigned to that cluster. Then, is not that point (which was chosen as 
clusterCenter) counted twice ?

Can someone please help me answering these questions.

Regards,
Aniruddha

Re: Understanding Mahout KMeans

Reply via email to