[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274568#comment-14274568
 ] 

Xiangrui Meng commented on SPARK-5056:
--

This is along the same direction with our discussion in SPARK-4510. If we 
choose a sample, is there any theoretical guarantee on the convergence? If we 
have 1 billion instances, what sample size would be proper? The original paper 
https://lirias.kuleuven.be/handle/123456789/426399, if I found the correct one, 
hasn't received many citations. 

In general, I think this algorithm is out of MLlib's scope. If someone is 
interested in implementing this algorithm, it would be best maintained outside 
Spark as a 3rd-party package. I'm going to mark it as Won't Fix, but feel 
free to reopen it if there are things I missed.

 Implementing Clara k-medoids clustering algorithm for large datasets
 

 Key: SPARK-5056
 URL: https://issues.apache.org/jira/browse/SPARK-5056
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor
  Labels: features

 There is a specific k-medoids clustering algorithm for large datasets. The 
 algorithm is called Clara in R, and is fully described in chapter 3 of 
 Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L 
 and Rousseeuw, PJ (1990). 
 The algorithm considers sub-datasets of fixed size (sampsize) such that the 
 time and storage requirements become linear in n rather than quadratic. Each 
 sub-dataset is partitioned into k clusters using the same algorithm as in 
 Partinioning around Medoids (PAM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-10 Thread Tomislav Milinovic (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272458#comment-14272458
 ] 

Tomislav Milinovic commented on SPARK-5056:
---

Partitioning Around Medoids (PAM) complexity of each iteration is O(k(n-k)2)
 For large values of n and k, such computation becomes very costly.

Clustering Large Applications (CLARA) is fully described in chapter 3 of 
Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction 
to Cluster Analysis.Wiley, New York. 
Compared to other partitioning methods such as PAM, it can deal with much 
larger datasets. Internally, this is achieved by considering sub-datasets of 
fixed size (sampsize) such that the time and storage requirements become linear 
in n rather than quadratic.
CLARA complexity of each Iteration is: O(ks2 + k(n-k))
s: the size of the sample
k: number of clusters
n: number of objects


 Implementing Clara k-medoids clustering algorithm for large datasets
 

 Key: SPARK-5056
 URL: https://issues.apache.org/jira/browse/SPARK-5056
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor
  Labels: features

 There is a specific k-medoids clustering algorithm for large datasets. The 
 algorithm is called Clara in R, and is fully described in chapter 3 of 
 Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L 
 and Rousseeuw, PJ (1990). 
 The algorithm considers sub-datasets of fixed size (sampsize) such that the 
 time and storage requirements become linear in n rather than quadratic. Each 
 sub-dataset is partitioned into k clusters using the same algorithm as in 
 Partinioning around Medoids (PAM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets

2015-01-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272277#comment-14272277
 ] 

Xiangrui Meng commented on SPARK-5056:
--

[~tmilinovic] We had some discussion in SPARK-4510 about the complexity of 
k-medoids possible solutions. Does the proposed algorithm have better 
complexity?

 Implementing Clara k-medoids clustering algorithm for large datasets
 

 Key: SPARK-5056
 URL: https://issues.apache.org/jira/browse/SPARK-5056
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tomislav Milinovic
Priority: Minor
  Labels: features

 There is a specific k-medoids clustering algorithm for large datasets. The 
 algorithm is called Clara in R, and is fully described in chapter 3 of 
 Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L 
 and Rousseeuw, PJ (1990). 
 The algorithm considers sub-datasets of fixed size (sampsize) such that the 
 time and storage requirements become linear in n rather than quadratic. Each 
 sub-dataset is partitioned into k clusters using the same algorithm as in 
 Partinioning around Medoids (PAM).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org