Stuti,

I'm answering your questions in order:

1. From MLLib
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L159
*,* you can see that clustering stops when we have reached*maxIterations* or
there are no more*activeRuns*.

KMeans is executed *runs* times in parallel, and the best clustering found
over all *runs* is returned. For each run, the algorithm will stop if:The
number of iteration reaches *maxIterations*, orEvery cluster center moved
less than*epsilon *in the last iteration.

2. I can't find the source code for Mahout that refer to the "Convergence
Threshold" but I suspect the threshold and MLLib's *epsilon*are the same
concepts. There is no concept of parallel runs in Mahout.

Ref: https://mahout.apache.org/users/clustering/k-means-clustering.html

3. To set MLLib's KMeans to have *epsilon *of 0.1 and then train the model,
you can do the following:

new KMeans().setK(k).setMaxIterations(
maxIterations).setRuns(runs).setInitializationMode(initializationMode)
*.setEpsilon(0.1)*.run(data)

Enjoy,

Long Pham
Software Engineer at Adatao, Inc.
longp...@adatao.com
On May 15, 2014 7:29 PM, "Stuti Awasthi" <stutiawas...@hcl.com> wrote:

>  Hi All,
>
>
>
> Any ideas on this ??
>
>
>
> Thanks
>
> Stuti Awasthi
>
>
>
> *From:* Stuti Awasthi
> *Sent:* Wednesday, May 14, 2014 6:20 PM
> *To:* user@spark.apache.org
> *Subject:* Understanding epsilon in KMeans
>
>
>
> Hi All,
>
>
>
> I wanted to understand the functionality of epsilon in KMeans in Spark
> MLlib.
>
>
>
> As per documentation :
>
> distance threshold within which we've consider centers to have
> converged.If all centers move less than this *Euclidean* distance, we
> stop iterating one run.
>
>
>
> Now I have assumed that if centers are moving less than epsilon value then
> Clustering Stops but then what does it mean by “we stop iterating one run”..
>
>
> Now suppose I have given maxIterations=10  and epsilon = 0.1 and assume
> that centers are afteronly 2 iteration, the epsilon condition is met i.e.
> now centers are moving only less than 0.1..
>
>
>
> Now what happens ?? The whole 10 iterations are completed OR the
> Clustering stops ??
>
>
>
> My 2nd query is in Mahout, there is a configuration param : “Convergence
> Threshold (cd)”   which states : “in an iteration, the centroids don’t move
> more than this distance, no further iterations are done and clustering
> stops.”
>
>
>
> So is epsilon and cd similar ??
>
>
>
> 3rd query :
>
> How to pass epsilon as a configurable param. KMeans.train() does not
> provide the way but in code I can see “setEpsilon” as method. SO if I want
> to pass the parameter as epsilon=0.1 , how may I do that..
>
>
>
> Pardon my ignorance
>
>
>
> Thanks
>
> Stuti Awasthi
>
>
>
>
>
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>

Reply via email to