[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-49130099 QA results for PR 1407:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1407#discussion_r14925960 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala --- @@ -59,6 +59,11 @@ private[mllib] object LocalKMeans extends Logging { cumulativeScore += weights(j) * KMeans.pointCost(curCenters, points(j)) j += 1 } + if (j == 0) { +logWarning(kMeansPlusPlus initialization ran out of distinct points for centers. + + s Using duplicate point for center k = $i.) +j = 1 --- End diff -- The code may be clearer if written in this way ~~~ centers(i) = if (j == 0) { logWarning(...) points(0).toDense } else { points(j - 1).toDense } ~~~ or ~~~ if (j == 0) { logWarning(...) centers(i) = points(0).toDense } else { centers(i) = points(j - 1).toDense } ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1407#discussion_r14926024 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -61,6 +61,30 @@ class KMeansSuite extends FunSuite with LocalSparkContext { assert(model.clusterCenters.head === center) } + test(no distinct points) { +val data = sc.parallelize(Array( + Vectors.dense(1.0, 2.0, 3.0), + Vectors.dense(1.0, 2.0, 3.0), + Vectors.dense(1.0, 2.0, 3.0) +)) +val center = Vectors.dense(1.0, 2.0, 3.0) + +// Make sure code runs. +var model = KMeans.train(data, k=2, maxIterations=1) +assert(model.clusterCenters.size === 2) + } + + test(more clusters than points) { +val data = sc.parallelize(Array( + Vectors.dense(1.0, 2.0, 3.0), + Vectors.dense(1.0, 3.0, 4.0) +)) --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1407#discussion_r14926012 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -61,6 +61,30 @@ class KMeansSuite extends FunSuite with LocalSparkContext { assert(model.clusterCenters.head === center) } + test(no distinct points) { +val data = sc.parallelize(Array( + Vectors.dense(1.0, 2.0, 3.0), + Vectors.dense(1.0, 2.0, 3.0), + Vectors.dense(1.0, 2.0, 3.0) +)) --- End diff -- add `, 2` to test two partitions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-49012803 @jkbradley The fix looks good to me except some minor style issues. Thanks for fixing it! Btw, please add `[MLLIB]` to the title so this is easy to find. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-49098158 QA results for PR 1407:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
GitHub user jkbradley opened a pull request: https://github.com/apache/spark/pull/1407 SPARK-1215: Clustering: Index out of bounds error Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error https://issues.apache.org/jira/browse/SPARK-1215 Solution: Print warning, and use duplicate cluster centers so that exactly k centers are returned. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkbradley/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1407.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1407 commit 97f2104bac2ab864c2a03f9a12a4b936557ae6d6 Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-20T01:35:53Z added RDD::stratifiedSample method and associated unit tests in RDDSuite. Method is built off of RDD::takeSample method. commit 91e83338820158b96cda492668dbed5fff33f19b Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-20T01:36:30Z added RDD::stratifiedSample method documentation commit d6f8913b7e370a82138b9c623754b32a59c21cf6 Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-21T04:58:12Z updated stratifiedSample to be more scalable, keeping data in RDDs instead of collecting to the driver commit 21eead6a412508b536358f4e557e2fab23c9c696 Author: Joseph Bradley joseph.kurata.brad...@gmail.com Date: 2014-05-23T19:51:07Z updated stratifiedSample to use selection-rejection to select samples on each partition in 1 pass, rather than pre-selecting indices commit 91f4b19702bc58a77d28316674eace881a81165f Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-09T21:37:25Z merging with new spark commit c0cb5f0d8c6104e3eb6cfa44820ba00b81bc7262 Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-11T18:43:17Z merging with updated spark commit 7d1b812a720cffdefe78ddb6e641930e7ae4975b Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-11T18:47:41Z removed my coding test updates commit 18e5c8ad740871be92c6d7b73f5d35e25641a734 Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-12T01:12:44Z Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle case with fewer distinct data points than clusters k. Added two related unit tests to KMeansSuite. commit e2bf638c6b3e8cc9cec3362caddb2305109d4c0a Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com Date: 2014-07-14T17:52:33Z Merge remote-tracking branch 'upstream/master' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48934486 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48948316 Jenkins, add to whitelist and test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48949060 QA tests have started for PR 1407. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1407#issuecomment-48960591 QA results for PR 1407:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---