[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2016-10-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15568591#comment-15568591
 ] 

Apache Spark commented on SPARK-3261:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15450

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Priority: Minor
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2016-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545011#comment-15545011
 ] 

Apache Spark commented on SPARK-3261:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15342

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Priority: Minor
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2016-09-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15467564#comment-15467564
 ] 

Apache Spark commented on SPARK-3261:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14948

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Priority: Minor
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2016-09-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15461156#comment-15461156
 ] 

Apache Spark commented on SPARK-3261:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14948

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Sean Owen
>Priority: Minor
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2015-02-24 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336169#comment-14336169
 ] 

Derrick Burns commented on SPARK-3261:
--

One solution is to run KMeansParallel or KMeansRandom after each Lloyds round 
to "replenish" empty clusters.

I have implemented the former in 
https://github.com/derrickburns/generalized-kmeans-clustering.

Performance is reasonable. 

Inspection reveals that the slow part of the KMeansParallel computation is the 
computation of the sum of the weights of the points in each cluster.  

However, the performance can be reduced by sampling the points and summing the 
contributions of each sampled point. For large data sets, this approach is 
appropriate.  

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>  Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2014-10-06 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161380#comment-14161380
 ] 

Derrick Burns commented on SPARK-3261:
--

Another possible source of duplicate cluster centers is the random 
initialization algorithm that samples with replacement.  It needs to sample 
without replacement.

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2014-10-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157115#comment-14157115
 ] 

Apache Spark commented on SPARK-3261:
-

User 'derrickburns' has created a pull request for this issue:
https://github.com/apache/spark/pull/2634

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2014-09-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14136445#comment-14136445
 ] 

Apache Spark commented on SPARK-3261:
-

User 'derrickburns' has created a pull request for this issue:
https://github.com/apache/spark/pull/2419

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2014-08-27 Thread Derrick Burns (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113095#comment-14113095
 ] 

Derrick Burns commented on SPARK-3261:
--

This choice also adversely affects performance.  I just ran clustering on 1.3M 
points, asking for 10,000 clusters.  This clustering run resulted in 1019 
unique cluster centers.  The original algorithm ran for 4.5 hours.  The 
algorithm that does not allow cluster centers completed in 45 minutes for a 6x 
speedup in this dataset. 

> KMeans clusterer can return duplicate cluster centers
> -
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org