And same set of centroids were used for both executions? On Wed, Sep 26, 2012 at 11:22 PM, nikos <[email protected]> wrote:
> The centroids have been selected in a previous execution of Mahout K-means > via randomSeed generator. > > > On 09/26/2012 08:43 PM, paritosh ranjan wrote: > >> By saying "Using the a pre-selected set of initial centroids" do you mean >> that the initial centroids were same in both executions? >> In other words, how are you choosing your initial centroids? >> >> On Wed, Sep 26, 2012 at 10:40 PM, nikos <[email protected]> wrote: >> >> I experience a strange situation when running Mahout K-means: Using the a >>> pre-selected set of initial centroids, I run K-means on a SequenceFile >>> generated by lucene.vector. The run is for testing purposes, so the file >>> is >>> small (around 10MB~10000 vectors). >>> >>> When K-means is executed with a single mapper (the default considering >>> the >>> Hadoop split size which in my cluster is 128MB), it reaches a given >>> clustering result in 2 iterations (Case A). However, I wanted to test if >>> there would be any improvement/deterioration in the algorithm's execution >>> speed by firing more mapping tasks (the Hadoop cluster has in total 6 >>> nodes). I therefore set the -Dmapred.max.split.size parameter to 5242880 >>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I indeed >>> succeeded in starting two mappers, but the strange thing was that the job >>> finished after 5 iterations instead of 2, and that even at the first >>> assignment of points to clusters, the mappers made different choices >>> compared to the single-map execution . What I mean is that after close >>> inspection of the clusterDump for the first iteration for both two >>> cases, I >>> found that in case B some points were not assigned to their closest >>> cluster. >>> >>> Could this behavior be justified by the existing K-means Mahout >>> implementation? >>> >>> Thanks in advance. >>> >>> >>> >>> >
