Re: Mahout K-means has different behavior based on the number of mapping tasks

nikos Fri, 05 Oct 2012 05:51:12 -0700

Hello,
is there any update on this?

Does the answer I got herehttp://stackoverflow.com/questions/12606701/mahout-k-means-has-different-behavior-based-on-the-number-of-mapping-taskssounds resonable to you? If it does it seems that there is a ratherserious implementation error on k-means.What do you think?


Nikos

On 09/27/12 13:17, nikos wrote:

Thank you for the answers,
so how could we check if there is a problem in the reducer?And if,indeed, there is could also explain why there are users thatexperience slow executions of K-means (http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E)?Also I have to mention that for (bigger) k near 100 again in the samedataset and same parameters and same initial centroids k-meansconverges when it runs on one mapper on two iterations but when Isplit the dataset in two mappers it does never converge and takes allthe iterations until it finishes (even if I set -x 100).
On 09/26/12 23:51, Jeff Eastman wrote:
Very odd indeed. Each mapper will start with the same set of clustersand assign points to clusters (clusters observe the points) basedupon the cluster centers (identical) and the chosen distance measure(also identical). At the end of the map step, each mapper sends itstrained clusters (with observation statistics s0, s1 & s2) to thereducer(s) keyed by clusterId.
In the reducer, the trained clusters are accumulated by taking thefirst and observing all the subsequent clusters (with the sameclusterId) with it. This is done by adding the s0, s1 and s2 valuesfrom each observed cluster.
Finally, each cluster is closed and a new center & radius iscalculated before it is output to begin the next iteration. If thereis a problem in the implementation, it would be in the reducer wherethe accumulations occur.
On 9/26/12 3:16 PM, paritosh ranjan wrote:
Each input split ( containing vectors in this case ) goes to adifferent
mapper task, and the clusters (models) are trained using the vectors
present in each mapper task, and the models are updated in the reducer.
This process is repeated till convergence/maxiteration. Since different
vectors went to different mapper tasks when two mapper tasks wereused, so,
it took time (more iterations) to converge, and also the results after
first iteration were different.

Look into CIMapper and CIReducer classes for more/better explanation.
On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan<[email protected]
wrote:
And same set of centroids were used for both executions?


On Wed, Sep 26, 2012 at 11:22 PM, nikos <[email protected]> wrote:
The centroids have been selected in a previous execution of Mahout
K-means via randomSeed generator.


On 09/26/2012 08:43 PM, paritosh ranjan wrote:
By saying "Using the a pre-selected set of initial centroids" doyou mean
that the initial centroids were same in both executions?
In other words, how are you choosing your initial centroids?
On Wed, Sep 26, 2012 at 10:40 PM, nikos <[email protected]>wrote:
I experience a strange situation when running Mahout K-means:Using the
a
pre-selected set of initial centroids, I run K-means on aSequenceFile
generated by lucene.vector. The run is for testing purposes, so the
file is
small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the defaultconsidering
the
Hadoop split size which in my cluster is 128MB), it reaches a given
clustering result in 2 iterations (Case A). However, I wanted totest if
there would be any improvement/deterioration in the algorithm's
execution
speed by firing more mapping tasks (the Hadoop cluster has intotal 6nodes). I therefore set the -Dmapred.max.split.size parameter to5242880bytes, in order to make mahout fire 2 mapping tasks (Case B). Iindeedsucceeded in starting two mappers, but the strange thing wasthat the
job
finished after 5 iterations instead of 2, and that even at thefirstassignment of points to clusters, the mappers made differentchoicescompared to the single-map execution . What I mean is that afterclose
inspection of the clusterDump for the first iteration for both two
cases, I
found that in case B some points were not assigned to their closest
cluster.

Could this behavior be justified by the existing K-means Mahout
implementation?

Thanks in advance.

Re: Mahout K-means has different behavior based on the number of mapping tasks

Reply via email to