subject:"\[jira\] \[Commented\] \(SPARK\-5056\) Implementing Clara k\-medoids clustering algorithm for large datasets"

[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-12 Thread Xiangrui Meng (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274568#comment-14274568
]

Xiangrui Meng commented on SPARK-5056:
--

This is along the same direction with our discussion in SPARK-4510. If we
choose a sample, is there any theoretical guarantee on the convergence? If we
have 1 billion instances, what sample size would be proper? The original paper
https://lirias.kuleuven.be/handle/123456789/426399, if I found the correct one,
hasn't received many citations.

In general, I think this algorithm is out of MLlib's scope. If someone is
interested in implementing this algorithm, it would be best maintained outside
Spark as a 3rd-party package. I'm going to mark it as Won't Fix, but feel
free to reopen it if there are things I missed.

Implementing Clara k-medoids clustering algorithm for large datasets

Key: SPARK-5056
URL: https://issues.apache.org/jira/browse/SPARK-5056
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor
Labels: features

There is a specific k-medoids clustering algorithm for large datasets. The
algorithm is called Clara in R, and is fully described in chapter 3 of
Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L
and Rousseeuw, PJ (1990).
The algorithm considers sub-datasets of fixed size (sampsize) such that the
time and storage requirements become linear in n rather than quadratic. Each
sub-dataset is partitioned into k clusters using the same algorithm as in
Partinioning around Medoids (PAM).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-10 Thread Tomislav Milinovic (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272458#comment-14272458
]

Tomislav Milinovic commented on SPARK-5056:
---

Partitioning Around Medoids (PAM) complexity of each iteration is O(k(n-k)2)
For large values of n and k, such computation becomes very costly.

Clustering Large Applications (CLARA) is fully described in chapter 3 of
Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction
to Cluster Analysis.Wiley, New York.
Compared to other partitioning methods such as PAM, it can deal with much
larger datasets. Internally, this is achieved by considering sub-datasets of
fixed size (sampsize) such that the time and storage requirements become linear
in n rather than quadratic.
CLARA complexity of each Iteration is: O(ks2 + k(n-k))
s: the size of the sample
k: number of clusters
n: number of objects

Implementing Clara k-medoids clustering algorithm for large datasets

Key: SPARK-5056
URL: https://issues.apache.org/jira/browse/SPARK-5056
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor
Labels: features

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-09 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272277#comment-14272277
 ] 

Xiangrui Meng commented on SPARK-5056:
--

[~tmilinovic] We had some discussion in SPARK-4510 about the complexity of 
k-medoids possible solutions. Does the proposed algorithm have better 
complexity?

 Implementing Clara k-medoids clustering algorithm for large datasets
 

 Key: SPARK-5056
 URL: https://issues.apache.org/jira/browse/SPARK-5056
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor
  Labels: features

 There is a specific k-medoids clustering algorithm for large datasets. The 
 algorithm is called Clara in R, and is fully described in chapter 3 of 
 Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L 
 and Rousseeuw, PJ (1990). 
 The algorithm considers sub-datasets of fixed size (sampsize) such that the 
 time and storage requirements become linear in n rather than quadratic. Each 
 sub-dataset is partitioned into k clusters using the same algorithm as in 
 Partinioning around Medoids (PAM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

3 matches

Site Navigation

Mail list logo

Footer information