[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568591#comment-15568591 ] Apache Spark commented on SPARK-3261: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15450 > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Priority: Minor > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545011#comment-15545011 ] Apache Spark commented on SPARK-3261: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15342 > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Priority: Minor > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467564#comment-15467564 ] Apache Spark commented on SPARK-3261: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14948 > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Priority: Minor > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15461156#comment-15461156 ] Apache Spark commented on SPARK-3261: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14948 > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Sean Owen >Priority: Minor > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336169#comment-14336169 ] Derrick Burns commented on SPARK-3261: -- One solution is to run KMeansParallel or KMeansRandom after each Lloyds round to "replenish" empty clusters. I have implemented the former in https://github.com/derrickburns/generalized-kmeans-clustering. Performance is reasonable. Inspection reveals that the slow part of the KMeansParallel computation is the computation of the sum of the weights of the points in each cluster. However, the performance can be reduced by sampling the points and summing the contributions of each sampled point. For large data sets, this approach is appropriate. > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > Labels: clustering > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161380#comment-14161380 ] Derrick Burns commented on SPARK-3261: -- Another possible source of duplicate cluster centers is the random initialization algorithm that samples with replacement. It needs to sample without replacement. > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157115#comment-14157115 ] Apache Spark commented on SPARK-3261: - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2634 > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136445#comment-14136445 ] Apache Spark commented on SPARK-3261: - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2419 > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers
[ https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113095#comment-14113095 ] Derrick Burns commented on SPARK-3261: -- This choice also adversely affects performance. I just ran clustering on 1.3M points, asking for 10,000 clusters. This clustering run resulted in 1019 unique cluster centers. The original algorithm ran for 4.5 hours. The algorithm that does not allow cluster centers completed in 45 minutes for a 6x speedup in this dataset. > KMeans clusterer can return duplicate cluster centers > - > > Key: SPARK-3261 > URL: https://issues.apache.org/jira/browse/SPARK-3261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns > > This is a bad design choice. I think that it is preferable to produce no > duplicate cluster centers. So instead of forcing the number of clusters to be > K, return at most K clusters. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org