[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15450


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-25 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84957082
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -257,11 +246,6 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 model = KMeans.train(rdd, k = 5, maxIterations = 10)
 assert(model.clusterCenters.sortBy(VectorWithCompare(_))
   .zip(points.sortBy(VectorWithCompare(_))).forall(x => x._1 ~== 
(x._2) absTol 1E-5))
-
-// Neither should more runs
--- End diff --

Let's fix the other comments that reference runs. e.g. "No matter how many 
iterations or runs we use "


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-25 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84923158
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -64,30 +66,55 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(model.clusterCenters.head ~== center absTol 1E-5)
   }
 
-  test("no distinct points") {
+  test("fewer distinct points than clusters") {
 val data = sc.parallelize(
   Array(
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0)),
   2)
-val center = Vectors.dense(1.0, 2.0, 3.0)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 2, maxIterations = 1)
-assert(model.clusterCenters.size === 2)
-  }
+var model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 1)
 
-  test("more clusters than points") {
-val data = sc.parallelize(
-  Array(
-Vectors.dense(1.0, 2.0, 3.0),
-Vectors.dense(1.0, 3.0, 4.0)),
-  2)
+model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 1)
+  }
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+test("unique cluster centers") {
+val rng = new Random(seed)
+val numDistinctPoints = 10
+val points = (0 until numDistinctPoints).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(points.flatMap(Array.fill(1 + 
rng.nextInt(3))(_)), 2)
+val normedData = data.map(new VectorWithNorm(_))
+
+// less centers than k
+val km = new KMeans().setK(50)
+  .setMaxIterations(5)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(seed)
+val initialCenters = km.initKMeansParallel(normedData).map(_.vector)
+assert(initialCenters.length === initialCenters.distinct.length)
+assert(initialCenters.length <= numDistinctPoints)
+
+val model = km.run(data)
+val finalCenters = model.clusterCenters
+assert(finalCenters.length === finalCenters.distinct.length)
+
+// run local k-means
+val km2 = new KMeans().setK(10)
+  .setMaxIterations(5)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(seed)
+val initialCenters2 = km2.initKMeansParallel(normedData).map(_.vector)
+assert(initialCenters2.length === initialCenters2.distinct.length)
+assert(initialCenters2.length === 10)
--- End diff --

minor/nit: maybe make `k` a val here and use that instead. Since we use 10 
for something else above, this can be obfuscated in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-25 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84924203
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -64,30 +66,55 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(model.clusterCenters.head ~== center absTol 1E-5)
   }
 
-  test("no distinct points") {
+  test("fewer distinct points than clusters") {
 val data = sc.parallelize(
   Array(
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0)),
   2)
-val center = Vectors.dense(1.0, 2.0, 3.0)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 2, maxIterations = 1)
-assert(model.clusterCenters.size === 2)
-  }
+var model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 1)
 
-  test("more clusters than points") {
-val data = sc.parallelize(
-  Array(
-Vectors.dense(1.0, 2.0, 3.0),
-Vectors.dense(1.0, 3.0, 4.0)),
-  2)
+model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 1)
+  }
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+test("unique cluster centers") {
+val rng = new Random(seed)
+val numDistinctPoints = 10
+val points = (0 until numDistinctPoints).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(points.flatMap(Array.fill(1 + 
rng.nextInt(3))(_)), 2)
+val normedData = data.map(new VectorWithNorm(_))
+
+// less centers than k
--- End diff --

Can we also test the "random" method here? I was going to suggest putting 
the test cases inside `Seq("k-means||", "random").foreach { initMode =>`, but 
we run the specific parallel case. Maybe just manually add it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-25 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84924052
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -64,30 +66,55 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(model.clusterCenters.head ~== center absTol 1E-5)
   }
 
-  test("no distinct points") {
+  test("fewer distinct points than clusters") {
 val data = sc.parallelize(
   Array(
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0)),
   2)
-val center = Vectors.dense(1.0, 2.0, 3.0)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 2, maxIterations = 1)
-assert(model.clusterCenters.size === 2)
-  }
+var model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 1)
 
-  test("more clusters than points") {
-val data = sc.parallelize(
-  Array(
-Vectors.dense(1.0, 2.0, 3.0),
-Vectors.dense(1.0, 3.0, 4.0)),
-  2)
+model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 1)
+  }
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+test("unique cluster centers") {
--- End diff --

nit: indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84899743
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
+val points = (0 until 10).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(
+  points.flatMap { point =>
+Array.fill(rng.nextInt(4))(point)
+  }, 2
+)
+val norms = data.map(Vectors.norm(_, 2.0))
+val zippedData = data.zip(norms).map { case (v, norm) =>
+  new VectorWithNorm(v, norm)
+}
+// less centers than k
+val km = new KMeans().setK(50)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters = km.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters.length === initialCenters.distinct.length)
+
+val model = km.run(data)
+val finalCenters = model.clusterCenters
+assert(finalCenters.length === finalCenters.distinct.length)
+
+// run local k-means
+val km2 = new KMeans().setK(10)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters2 = km2.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters2.length === initialCenters2.distinct.length)
--- End diff --

This condition failed, though it should be OK. The problem was that the 
data setup maps each of 10 distinct points to 0-3 copies, meaning that there 
may be less than 10 distinct points in the end. I just make that 1-3 copies and 
it works.

Fixed the doc problem too, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84776495
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -323,7 +324,10 @@ class KMeans private (
* Initialize a set of cluster centers at random.
*/
   private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] 
= {
-data.takeSample(true, k, new 
XORShiftRandom(this.seed).nextInt()).map(_.toDense)
+// Select without replacement; may still produce duplicates if the 
data has < k distinct
+// points, so deduplicate the centroids to match the behavior of 
k-means|| in the same situation
+data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()).
+  map(_.vector).distinct.map(new VectorWithNorm(_))
--- End diff --

Ok, I think you're probably right about it being negligible. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84767540
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
+val points = (0 until 10).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(
+  points.flatMap { point =>
+Array.fill(rng.nextInt(4))(point)
+  }, 2
+)
+val norms = data.map(Vectors.norm(_, 2.0))
+val zippedData = data.zip(norms).map { case (v, norm) =>
--- End diff --

Just make this `val normedData = data.map(new VectorWithNorm(_))`. The 
zipping was done for performance (not necessary here) and I just copied the 
code originally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84720821
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
--- End diff --

make `seed` a val and use it instead of literal 42.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84716994
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -48,7 +48,11 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   final val k = new IntParam(this, "k", "The number of clusters to create. 
" +
 "Must be > 1.", ParamValidators.gt(1))
 
-  /** @group getParam */
+  /**
+   * Number of clusters to create (k). Note that it is possible for fewer 
than k clusters to
--- End diff --

It seems the typical convention for this type of documentation is to attach 
it to the definition of the param, instead of in both the getter and setter. 
That way it only appears once. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84776084
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
+val points = (0 until 10).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(
+  points.flatMap { point =>
+Array.fill(rng.nextInt(4))(point)
+  }, 2
+)
+val norms = data.map(Vectors.norm(_, 2.0))
+val zippedData = data.zip(norms).map { case (v, norm) =>
+  new VectorWithNorm(v, norm)
+}
+// less centers than k
+val km = new KMeans().setK(50)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters = km.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters.length === initialCenters.distinct.length)
--- End diff --

also check `initialCenters.length <= numDistinctPoints` where 
`numDistinctPoints` is 10 (defined above) at the moment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84721497
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
+val points = (0 until 10).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(
+  points.flatMap { point =>
+Array.fill(rng.nextInt(4))(point)
+  }, 2
+)
+val norms = data.map(Vectors.norm(_, 2.0))
+val zippedData = data.zip(norms).map { case (v, norm) =>
+  new VectorWithNorm(v, norm)
+}
+// less centers than k
+val km = new KMeans().setK(50)
+  .setMaxIterations(10)
--- End diff --

maybe use 5 or less for max iterations. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84776151
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
+val points = (0 until 10).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(
+  points.flatMap { point =>
+Array.fill(rng.nextInt(4))(point)
+  }, 2
+)
+val norms = data.map(Vectors.norm(_, 2.0))
+val zippedData = data.zip(norms).map { case (v, norm) =>
+  new VectorWithNorm(v, norm)
+}
+// less centers than k
+val km = new KMeans().setK(50)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters = km.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters.length === initialCenters.distinct.length)
+
+val model = km.run(data)
+val finalCenters = model.clusterCenters
+assert(finalCenters.length === finalCenters.distinct.length)
+
+// run local k-means
+val km2 = new KMeans().setK(10)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters2 = km2.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters2.length === initialCenters2.distinct.length)
--- End diff --

also check `initialCenters2.length === 2`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84775879
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -64,18 +64,34 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(model.clusterCenters.head ~== center absTol 1E-5)
   }
 
-  test("no distinct points") {
+  test("fewer distinct points than clusters") {
 val data = sc.parallelize(
   Array(
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0),
 Vectors.dense(1.0, 2.0, 3.0)),
   2)
-val center = Vectors.dense(1.0, 2.0, 3.0)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 2, maxIterations = 1)
-assert(model.clusterCenters.size === 2)
+var model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 1)
+
+model = KMeans.train(data, k = 2, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 1)
+  }
+
+
+  test("fewer clusters than points") {
--- End diff --

I'd prefer to remove these two tests since we've added a more thorough test 
below. We can check the "random" init method in that test as well, then we can 
eliminate these two.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84776242
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 Vectors.dense(1.0, 3.0, 4.0)),
   2)
 
-// Make sure code runs.
-var model = KMeans.train(data, k = 3, maxIterations = 1)
-assert(model.clusterCenters.size === 3)
+var model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "random")
+assert(model.clusterCenters.length === 2)
+
+model = KMeans.train(data, k = 3, maxIterations = 1, 
initializationMode = "k-means||")
+assert(model.clusterCenters.length === 2)
+  }
+
+  test("unique cluster centers") {
+val rng = new scala.util.Random(42)
+val points = (0 until 10).map(i => 
Vectors.dense(Array.fill(3)(rng.nextDouble)))
+val data = sc.parallelize(
+  points.flatMap { point =>
+Array.fill(rng.nextInt(4))(point)
+  }, 2
+)
+val norms = data.map(Vectors.norm(_, 2.0))
+val zippedData = data.zip(norms).map { case (v, norm) =>
+  new VectorWithNorm(v, norm)
+}
+// less centers than k
+val km = new KMeans().setK(50)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters = km.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters.length === initialCenters.distinct.length)
+
+val model = km.run(data)
+val finalCenters = model.clusterCenters
+assert(finalCenters.length === finalCenters.distinct.length)
+
+// run local k-means
+val km2 = new KMeans().setK(10)
+  .setMaxIterations(10)
+  .setInitializationMode("k-means||")
+  .setInitializationSteps(10)
+  .setSeed(42)
+val initialCenters2 = km2.initKMeansParallel(zippedData).map(_.vector)
+assert(initialCenters2.length === initialCenters2.distinct.length)
+
+val model2 = km.run(data)
--- End diff --

should be `val model2 = km2.run(data)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84455681
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -378,10 +382,10 @@ class KMeans private (
 costs.unpersist(blocking = false)
 bcNewCentersList.foreach(_.destroy(false))
 
-if (centers.size == k) {
+if (centers.size <= k) {
--- End diff --

That's right, although it's not a goal to guarantee distinct centers, 
though that's a nice to have where possible. I think my goal is improving the 
straightforward cases, and, maintaining as much consistency as possible. I 
agree with adding a test like this and adding a call to `.distinct`. Might as 
well take this to a logical conclusion.

It also raises the interesting question: if you have >= k distinct points, 
and happen to pick < k distinct centroids, should you go back and replenish the 
set of centroids? I am punting on that right now but it's a legitimate point. 
It's quite a corner case though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84247187
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -323,7 +324,10 @@ class KMeans private (
* Initialize a set of cluster centers at random.
*/
   private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] 
= {
-data.takeSample(true, k, new 
XORShiftRandom(this.seed).nextInt()).map(_.toDense)
+// Select without replacement; may still produce duplicates if the 
data has < k distinct
+// points, so deduplicate the centroids to match the behavior of 
k-means|| in the same situation
+data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()).
+  map(_.vector).distinct.map(new VectorWithNorm(_))
--- End diff --

I'll move the dot. Agree, though computing k norms once seems negligible in 
the context of the overall computation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-20 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84247083
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
 // Make sure code runs.
 var model = KMeans.train(data, k = 2, maxIterations = 1)
-assert(model.clusterCenters.size === 2)
+assert(model.clusterCenters.size === 1)
--- End diff --

I will add more tests, yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84192775
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
 // Make sure code runs.
--- End diff --

while we're here, can we make `var model = ...` into `val model = ...` and 
delete L74 `val center = ...` which is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84193444
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
 // Make sure code runs.
 var model = KMeans.train(data, k = 2, maxIterations = 1)
-assert(model.clusterCenters.size === 2)
+assert(model.clusterCenters.size === 1)
--- End diff --

these tests don't cover `initRandom`. We should add both initialization 
methods to be complete


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/15450#discussion_r84194269
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -323,7 +324,10 @@ class KMeans private (
* Initialize a set of cluster centers at random.
*/
   private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] 
= {
-data.takeSample(true, k, new 
XORShiftRandom(this.seed).nextInt()).map(_.toDense)
+// Select without replacement; may still produce duplicates if the 
data has < k distinct
+// points, so deduplicate the centroids to match the behavior of 
k-means|| in the same situation
+data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()).
+  map(_.vector).distinct.map(new VectorWithNorm(_))
--- End diff --

I think we generally put the `.` on the next line, but I'm not sure it's a 
strict style requirement.

Also, we end up recomputing the norm of each sample. For large `k` this can 
be inefficient. I'm not sure of a non-ugly way to implement the distinct 
otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...

2016-10-12 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/15450

[SPARK-3261] [MLLIB] KMeans clusterer can return duplicate cluster centers

## What changes were proposed in this pull request?

Return potentially fewer than k cluster centers in cases where k distinct 
centroids aren't available or aren't selected.

## How was this patch tested?

Existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-3261

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15450


commit 42279b8e042aedaf54ae3900fc1b050d2a1dacef
Author: Sean Owen 
Date:   2016-10-12T12:30:02Z

Return potentially fewer than k cluster centers in cases where k distinct 
centroids aren't available or aren't selected




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org