[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets
[ https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274568#comment-14274568 ] Xiangrui Meng commented on SPARK-5056: -- This is along the same direction with our discussion in SPARK-4510. If we choose a sample, is there any theoretical guarantee on the convergence? If we have 1 billion instances, what sample size would be proper? The original paper https://lirias.kuleuven.be/handle/123456789/426399, if I found the correct one, hasn't received many citations. In general, I think this algorithm is out of MLlib's scope. If someone is interested in implementing this algorithm, it would be best maintained outside Spark as a 3rd-party package. I'm going to mark it as Won't Fix, but feel free to reopen it if there are things I missed. Implementing Clara k-medoids clustering algorithm for large datasets Key: SPARK-5056 URL: https://issues.apache.org/jira/browse/SPARK-5056 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tomislav Milinovic Priority: Minor Labels: features There is a specific k-medoids clustering algorithm for large datasets. The algorithm is called Clara in R, and is fully described in chapter 3 of Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L and Rousseeuw, PJ (1990). The algorithm considers sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic. Each sub-dataset is partitioned into k clusters using the same algorithm as in Partinioning around Medoids (PAM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets
[ https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272458#comment-14272458 ] Tomislav Milinovic commented on SPARK-5056: --- Partitioning Around Medoids (PAM) complexity of each iteration is O(k(n-k)2) For large values of n and k, such computation becomes very costly. Clustering Large Applications (CLARA) is fully described in chapter 3 of Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis.Wiley, New York. Compared to other partitioning methods such as PAM, it can deal with much larger datasets. Internally, this is achieved by considering sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic. CLARA complexity of each Iteration is: O(ks2 + k(n-k)) s: the size of the sample k: number of clusters n: number of objects Implementing Clara k-medoids clustering algorithm for large datasets Key: SPARK-5056 URL: https://issues.apache.org/jira/browse/SPARK-5056 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tomislav Milinovic Priority: Minor Labels: features There is a specific k-medoids clustering algorithm for large datasets. The algorithm is called Clara in R, and is fully described in chapter 3 of Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L and Rousseeuw, PJ (1990). The algorithm considers sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic. Each sub-dataset is partitioned into k clusters using the same algorithm as in Partinioning around Medoids (PAM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5056) Implementing Clara k-medoids clustering algorithm for large datasets
[ https://issues.apache.org/jira/browse/SPARK-5056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272277#comment-14272277 ] Xiangrui Meng commented on SPARK-5056: -- [~tmilinovic] We had some discussion in SPARK-4510 about the complexity of k-medoids possible solutions. Does the proposed algorithm have better complexity? Implementing Clara k-medoids clustering algorithm for large datasets Key: SPARK-5056 URL: https://issues.apache.org/jira/browse/SPARK-5056 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tomislav Milinovic Priority: Minor Labels: features There is a specific k-medoids clustering algorithm for large datasets. The algorithm is called Clara in R, and is fully described in chapter 3 of Finding Groups in Data: An Introduction to Cluster Analysis. by Kaufman, L and Rousseeuw, PJ (1990). The algorithm considers sub-datasets of fixed size (sampsize) such that the time and storage requirements become linear in n rather than quadratic. Each sub-dataset is partitioned into k clusters using the same algorithm as in Partinioning around Medoids (PAM). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org