Sorry, i think more commonly if aggregating transpose is to be used, then
cenroid assignments are better be the key of the matrix D (so D:= A) and
aggregating transpose is performed on a matrix (1 | D)'  (i.e., 1 cbind
D).t  so that the first row of result contains counts of cluster points and
we can finish up cluster assignment via

M = (1 | D)'
C = M(:,2:) with each row hadamard-divided by first row of counts M(:,1)
(implying Golub-Van Loan notations for subblocking)

On Wed, Mar 29, 2017 at 9:02 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> the simplest scheme is to initialize distributed matrix of the shape D :=
> (0 | A) where A is your dataset and 0 is a single column indicating current
> centroid assignment and distribute current centroid matrix C via matrix
> broadcast (assuming there are few enough centers).
>
> Then alternatively run cluster assignment within mapBlock() operator on D
> with recomputation of new centroids C afterwards. Recomputation of
> centroids can be done via aggregating transpose.
>
> of course a better scheme includes pre-sketching (k-means ||) and use of a
> triangle inequality during recomputations.
>
> On Wed, Mar 29, 2017 at 8:30 AM, KHATWANI PARTH BHARAT <
> h2016...@pilani.bits-pilani.ac.in> wrote:
>
>> Sir,
>> I am trying to write the kmeans clustering algorithm using Mahout Samsara
>> but i am bit confused
>> about how to leverage Distributed Row Matrix for the same. Can anybody
>> help
>> me with same.
>>
>>
>>
>>
>>
>> Thanks
>> Parth Khatwani
>>
>
>

Reply via email to