Hello,
is there any update on this?
Does the answer I got here
http://stackoverflow.com/questions/12606701/mahout-k-means-has-different-behavior-based-on-the-number-of-mapping-tasks
sounds resonable to you? If it does it seems that there is a rather
serious implementation error on k-means.What do you think?
Nikos
On 09/27/12 13:17, nikos wrote:
Thank you for the answers,
so how could we check if there is a problem in the reducer?And if,
indeed, there is could also explain why there are users that
experience slow executions of K-means (
http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%[email protected]%3E)?
Also I have to mention that for (bigger) k near 100 again in the same
dataset and same parameters and same initial centroids k-means
converges when it runs on one mapper on two iterations but when I
split the dataset in two mappers it does never converge and takes all
the iterations until it finishes (even if I set -x 100).
On 09/26/12 23:51, Jeff Eastman wrote:
Very odd indeed. Each mapper will start with the same set of clusters
and assign points to clusters (clusters observe the points) based
upon the cluster centers (identical) and the chosen distance measure
(also identical). At the end of the map step, each mapper sends its
trained clusters (with observation statistics s0, s1 & s2) to the
reducer(s) keyed by clusterId.
In the reducer, the trained clusters are accumulated by taking the
first and observing all the subsequent clusters (with the same
clusterId) with it. This is done by adding the s0, s1 and s2 values
from each observed cluster.
Finally, each cluster is closed and a new center & radius is
calculated before it is output to begin the next iteration. If there
is a problem in the implementation, it would be in the reducer where
the accumulations occur.
On 9/26/12 3:16 PM, paritosh ranjan wrote:
Each input split ( containing vectors in this case ) goes to a
different
mapper task, and the clusters (models) are trained using the vectors
present in each mapper task, and the models are updated in the reducer.
This process is repeated till convergence/maxiteration. Since different
vectors went to different mapper tasks when two mapper tasks were
used, so,
it took time (more iterations) to converge, and also the results after
first iteration were different.
Look into CIMapper and CIReducer classes for more/better explanation.
On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan
<[email protected]
wrote:
And same set of centroids were used for both executions?
On Wed, Sep 26, 2012 at 11:22 PM, nikos <[email protected]> wrote:
The centroids have been selected in a previous execution of Mahout
K-means via randomSeed generator.
On 09/26/2012 08:43 PM, paritosh ranjan wrote:
By saying "Using the a pre-selected set of initial centroids" do
you mean
that the initial centroids were same in both executions?
In other words, how are you choosing your initial centroids?
On Wed, Sep 26, 2012 at 10:40 PM, nikos <[email protected]>
wrote:
I experience a strange situation when running Mahout K-means:
Using the
a
pre-selected set of initial centroids, I run K-means on a
SequenceFile
generated by lucene.vector. The run is for testing purposes, so the
file is
small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default
considering
the
Hadoop split size which in my cluster is 128MB), it reaches a given
clustering result in 2 iterations (Case A). However, I wanted to
test if
there would be any improvement/deterioration in the algorithm's
execution
speed by firing more mapping tasks (the Hadoop cluster has in
total 6
nodes). I therefore set the -Dmapred.max.split.size parameter to
5242880
bytes, in order to make mahout fire 2 mapping tasks (Case B). I
indeed
succeeded in starting two mappers, but the strange thing was
that the
job
finished after 5 iterations instead of 2, and that even at the
first
assignment of points to clusters, the mappers made different
choices
compared to the single-map execution . What I mean is that after
close
inspection of the clusterDump for the first iteration for both two
cases, I
found that in case B some points were not assigned to their closest
cluster.
Could this behavior be justified by the existing K-means Mahout
implementation?
Thanks in advance.