[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-49130099
  
QA results for PR 1407:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16712/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1407#discussion_r14925960
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala ---
@@ -59,6 +59,11 @@ private[mllib] object LocalKMeans extends Logging {
 cumulativeScore += weights(j) * KMeans.pointCost(curCenters, 
points(j))
 j += 1
   }
+  if (j == 0) {
+logWarning(kMeansPlusPlus initialization ran out of distinct 
points for centers. +
+  s Using duplicate point for center k = $i.)
+j = 1
--- End diff --

The code may be clearer if written in this way

~~~
centers(i) = 
  if (j == 0) {
logWarning(...)
points(0).toDense
  } else {
points(j - 1).toDense
  }
~~~

or 

~~~
if (j == 0) {
  logWarning(...)
  centers(i) = points(0).toDense
} else {
  centers(i) = points(j - 1).toDense
}
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1407#discussion_r14926024
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -61,6 +61,30 @@ class KMeansSuite extends FunSuite with 
LocalSparkContext {
 assert(model.clusterCenters.head === center)
   }
 
+  test(no distinct points) {
+val data = sc.parallelize(Array(
+  Vectors.dense(1.0, 2.0, 3.0),
+  Vectors.dense(1.0, 2.0, 3.0),
+  Vectors.dense(1.0, 2.0, 3.0)
+))
+val center = Vectors.dense(1.0, 2.0, 3.0)
+
+// Make sure code runs.
+var model = KMeans.train(data, k=2, maxIterations=1)
+assert(model.clusterCenters.size === 2)
+  }
+
+  test(more clusters than points) {
+val data = sc.parallelize(Array(
+  Vectors.dense(1.0, 2.0, 3.0),
+  Vectors.dense(1.0, 3.0, 4.0)
+))
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1407#discussion_r14926012
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -61,6 +61,30 @@ class KMeansSuite extends FunSuite with 
LocalSparkContext {
 assert(model.clusterCenters.head === center)
   }
 
+  test(no distinct points) {
+val data = sc.parallelize(Array(
+  Vectors.dense(1.0, 2.0, 3.0),
+  Vectors.dense(1.0, 2.0, 3.0),
+  Vectors.dense(1.0, 2.0, 3.0)
+))
--- End diff --

add `, 2` to test two partitions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-15 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-49012803
  
@jkbradley The fix looks good to me except some minor style issues. Thanks 
for fixing it! Btw, please add `[MLLIB]` to the title so this is easy to find.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-49098158
  
QA results for PR 1407:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16689/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/1407

SPARK-1215: Clustering: Index out of bounds error

Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error

 https://issues.apache.org/jira/browse/SPARK-1215

Solution: Print warning, and use duplicate cluster centers so that exactly 
k centers are returned.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1407


commit 97f2104bac2ab864c2a03f9a12a4b936557ae6d6
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-20T01:35:53Z

added RDD::stratifiedSample method and associated unit tests in RDDSuite.  
Method is built off of RDD::takeSample method.

commit 91e83338820158b96cda492668dbed5fff33f19b
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-20T01:36:30Z

added RDD::stratifiedSample method documentation

commit d6f8913b7e370a82138b9c623754b32a59c21cf6
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-21T04:58:12Z

updated stratifiedSample to be more scalable, keeping data in RDDs instead 
of collecting to the driver

commit 21eead6a412508b536358f4e557e2fab23c9c696
Author: Joseph Bradley joseph.kurata.brad...@gmail.com
Date:   2014-05-23T19:51:07Z

updated stratifiedSample to use selection-rejection to select samples on 
each partition in 1 pass, rather than pre-selecting indices

commit 91f4b19702bc58a77d28316674eace881a81165f
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-09T21:37:25Z

merging with new spark

commit c0cb5f0d8c6104e3eb6cfa44820ba00b81bc7262
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-11T18:43:17Z

merging with updated spark

commit 7d1b812a720cffdefe78ddb6e641930e7ae4975b
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-11T18:47:41Z

removed my coding test updates

commit 18e5c8ad740871be92c6d7b73f5d35e25641a734
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-12T01:12:44Z

Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle 
case with fewer distinct data points than clusters k.  Added two related unit 
tests to KMeansSuite.

commit e2bf638c6b3e8cc9cec3362caddb2305109d4c0a
Author: Joseph K. Bradley joseph.kurata.brad...@gmail.com
Date:   2014-07-14T17:52:33Z

Merge remote-tracking branch 'upstream/master'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48934486
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48948316
  
Jenkins, add to whitelist and test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48949060
  
QA tests have started for PR 1407. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1215: Clustering: Index out of bounds er...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1407#issuecomment-48960591
  
QA results for PR 1407:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16633/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---