[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15450 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84957082 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -257,11 +246,6 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { model = KMeans.train(rdd, k = 5, maxIterations = 10) assert(model.clusterCenters.sortBy(VectorWithCompare(_)) .zip(points.sortBy(VectorWithCompare(_))).forall(x => x._1 ~== (x._2) absTol 1E-5)) - -// Neither should more runs --- End diff -- Let's fix the other comments that reference runs. e.g. "No matter how many iterations or runs we use " --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84923158 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -64,30 +66,55 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { assert(model.clusterCenters.head ~== center absTol 1E-5) } - test("no distinct points") { + test("fewer distinct points than clusters") { val data = sc.parallelize( Array( Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0)), 2) -val center = Vectors.dense(1.0, 2.0, 3.0) -// Make sure code runs. -var model = KMeans.train(data, k = 2, maxIterations = 1) -assert(model.clusterCenters.size === 2) - } +var model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 1) - test("more clusters than points") { -val data = sc.parallelize( - Array( -Vectors.dense(1.0, 2.0, 3.0), -Vectors.dense(1.0, 3.0, 4.0)), - 2) +model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 1) + } -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +test("unique cluster centers") { +val rng = new Random(seed) +val numDistinctPoints = 10 +val points = (0 until numDistinctPoints).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize(points.flatMap(Array.fill(1 + rng.nextInt(3))(_)), 2) +val normedData = data.map(new VectorWithNorm(_)) + +// less centers than k +val km = new KMeans().setK(50) + .setMaxIterations(5) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(seed) +val initialCenters = km.initKMeansParallel(normedData).map(_.vector) +assert(initialCenters.length === initialCenters.distinct.length) +assert(initialCenters.length <= numDistinctPoints) + +val model = km.run(data) +val finalCenters = model.clusterCenters +assert(finalCenters.length === finalCenters.distinct.length) + +// run local k-means +val km2 = new KMeans().setK(10) + .setMaxIterations(5) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(seed) +val initialCenters2 = km2.initKMeansParallel(normedData).map(_.vector) +assert(initialCenters2.length === initialCenters2.distinct.length) +assert(initialCenters2.length === 10) --- End diff -- minor/nit: maybe make `k` a val here and use that instead. Since we use 10 for something else above, this can be obfuscated in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84924203 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -64,30 +66,55 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { assert(model.clusterCenters.head ~== center absTol 1E-5) } - test("no distinct points") { + test("fewer distinct points than clusters") { val data = sc.parallelize( Array( Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0)), 2) -val center = Vectors.dense(1.0, 2.0, 3.0) -// Make sure code runs. -var model = KMeans.train(data, k = 2, maxIterations = 1) -assert(model.clusterCenters.size === 2) - } +var model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 1) - test("more clusters than points") { -val data = sc.parallelize( - Array( -Vectors.dense(1.0, 2.0, 3.0), -Vectors.dense(1.0, 3.0, 4.0)), - 2) +model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 1) + } -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +test("unique cluster centers") { +val rng = new Random(seed) +val numDistinctPoints = 10 +val points = (0 until numDistinctPoints).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize(points.flatMap(Array.fill(1 + rng.nextInt(3))(_)), 2) +val normedData = data.map(new VectorWithNorm(_)) + +// less centers than k --- End diff -- Can we also test the "random" method here? I was going to suggest putting the test cases inside `Seq("k-means||", "random").foreach { initMode =>`, but we run the specific parallel case. Maybe just manually add it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84924052 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -64,30 +66,55 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { assert(model.clusterCenters.head ~== center absTol 1E-5) } - test("no distinct points") { + test("fewer distinct points than clusters") { val data = sc.parallelize( Array( Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0)), 2) -val center = Vectors.dense(1.0, 2.0, 3.0) -// Make sure code runs. -var model = KMeans.train(data, k = 2, maxIterations = 1) -assert(model.clusterCenters.size === 2) - } +var model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 1) - test("more clusters than points") { -val data = sc.parallelize( - Array( -Vectors.dense(1.0, 2.0, 3.0), -Vectors.dense(1.0, 3.0, 4.0)), - 2) +model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 1) + } -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +test("unique cluster centers") { --- End diff -- nit: indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84899743 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) +val points = (0 until 10).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize( + points.flatMap { point => +Array.fill(rng.nextInt(4))(point) + }, 2 +) +val norms = data.map(Vectors.norm(_, 2.0)) +val zippedData = data.zip(norms).map { case (v, norm) => + new VectorWithNorm(v, norm) +} +// less centers than k +val km = new KMeans().setK(50) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters = km.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters.length === initialCenters.distinct.length) + +val model = km.run(data) +val finalCenters = model.clusterCenters +assert(finalCenters.length === finalCenters.distinct.length) + +// run local k-means +val km2 = new KMeans().setK(10) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters2 = km2.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters2.length === initialCenters2.distinct.length) --- End diff -- This condition failed, though it should be OK. The problem was that the data setup maps each of 10 distinct points to 0-3 copies, meaning that there may be less than 10 distinct points in the end. I just make that 1-3 copies and it works. Fixed the doc problem too, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84776495 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -323,7 +324,10 @@ class KMeans private ( * Initialize a set of cluster centers at random. */ private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = { -data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense) +// Select without replacement; may still produce duplicates if the data has < k distinct +// points, so deduplicate the centroids to match the behavior of k-means|| in the same situation +data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()). + map(_.vector).distinct.map(new VectorWithNorm(_)) --- End diff -- Ok, I think you're probably right about it being negligible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84767540 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) +val points = (0 until 10).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize( + points.flatMap { point => +Array.fill(rng.nextInt(4))(point) + }, 2 +) +val norms = data.map(Vectors.norm(_, 2.0)) +val zippedData = data.zip(norms).map { case (v, norm) => --- End diff -- Just make this `val normedData = data.map(new VectorWithNorm(_))`. The zipping was done for performance (not necessary here) and I just copied the code originally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84720821 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) --- End diff -- make `seed` a val and use it instead of literal 42. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84716994 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -48,7 +48,11 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe final val k = new IntParam(this, "k", "The number of clusters to create. " + "Must be > 1.", ParamValidators.gt(1)) - /** @group getParam */ + /** + * Number of clusters to create (k). Note that it is possible for fewer than k clusters to --- End diff -- It seems the typical convention for this type of documentation is to attach it to the definition of the param, instead of in both the getter and setter. That way it only appears once. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84776084 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) +val points = (0 until 10).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize( + points.flatMap { point => +Array.fill(rng.nextInt(4))(point) + }, 2 +) +val norms = data.map(Vectors.norm(_, 2.0)) +val zippedData = data.zip(norms).map { case (v, norm) => + new VectorWithNorm(v, norm) +} +// less centers than k +val km = new KMeans().setK(50) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters = km.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters.length === initialCenters.distinct.length) --- End diff -- also check `initialCenters.length <= numDistinctPoints` where `numDistinctPoints` is 10 (defined above) at the moment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84721497 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) +val points = (0 until 10).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize( + points.flatMap { point => +Array.fill(rng.nextInt(4))(point) + }, 2 +) +val norms = data.map(Vectors.norm(_, 2.0)) +val zippedData = data.zip(norms).map { case (v, norm) => + new VectorWithNorm(v, norm) +} +// less centers than k +val km = new KMeans().setK(50) + .setMaxIterations(10) --- End diff -- maybe use 5 or less for max iterations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84776151 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) +val points = (0 until 10).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize( + points.flatMap { point => +Array.fill(rng.nextInt(4))(point) + }, 2 +) +val norms = data.map(Vectors.norm(_, 2.0)) +val zippedData = data.zip(norms).map { case (v, norm) => + new VectorWithNorm(v, norm) +} +// less centers than k +val km = new KMeans().setK(50) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters = km.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters.length === initialCenters.distinct.length) + +val model = km.run(data) +val finalCenters = model.clusterCenters +assert(finalCenters.length === finalCenters.distinct.length) + +// run local k-means +val km2 = new KMeans().setK(10) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters2 = km2.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters2.length === initialCenters2.distinct.length) --- End diff -- also check `initialCenters2.length === 2` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84775879 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -64,18 +64,34 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { assert(model.clusterCenters.head ~== center absTol 1E-5) } - test("no distinct points") { + test("fewer distinct points than clusters") { val data = sc.parallelize( Array( Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(1.0, 2.0, 3.0)), 2) -val center = Vectors.dense(1.0, 2.0, 3.0) -// Make sure code runs. -var model = KMeans.train(data, k = 2, maxIterations = 1) -assert(model.clusterCenters.size === 2) +var model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 1) + +model = KMeans.train(data, k = 2, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 1) + } + + + test("fewer clusters than points") { --- End diff -- I'd prefer to remove these two tests since we've added a more thorough test below. We can check the "random" init method in that test as well, then we can eliminate these two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84776242 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -85,9 +101,50 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { Vectors.dense(1.0, 3.0, 4.0)), 2) -// Make sure code runs. -var model = KMeans.train(data, k = 3, maxIterations = 1) -assert(model.clusterCenters.size === 3) +var model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "random") +assert(model.clusterCenters.length === 2) + +model = KMeans.train(data, k = 3, maxIterations = 1, initializationMode = "k-means||") +assert(model.clusterCenters.length === 2) + } + + test("unique cluster centers") { +val rng = new scala.util.Random(42) +val points = (0 until 10).map(i => Vectors.dense(Array.fill(3)(rng.nextDouble))) +val data = sc.parallelize( + points.flatMap { point => +Array.fill(rng.nextInt(4))(point) + }, 2 +) +val norms = data.map(Vectors.norm(_, 2.0)) +val zippedData = data.zip(norms).map { case (v, norm) => + new VectorWithNorm(v, norm) +} +// less centers than k +val km = new KMeans().setK(50) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters = km.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters.length === initialCenters.distinct.length) + +val model = km.run(data) +val finalCenters = model.clusterCenters +assert(finalCenters.length === finalCenters.distinct.length) + +// run local k-means +val km2 = new KMeans().setK(10) + .setMaxIterations(10) + .setInitializationMode("k-means||") + .setInitializationSteps(10) + .setSeed(42) +val initialCenters2 = km2.initKMeansParallel(zippedData).map(_.vector) +assert(initialCenters2.length === initialCenters2.distinct.length) + +val model2 = km.run(data) --- End diff -- should be `val model2 = km2.run(data)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84455681 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -378,10 +382,10 @@ class KMeans private ( costs.unpersist(blocking = false) bcNewCentersList.foreach(_.destroy(false)) -if (centers.size == k) { +if (centers.size <= k) { --- End diff -- That's right, although it's not a goal to guarantee distinct centers, though that's a nice to have where possible. I think my goal is improving the straightforward cases, and, maintaining as much consistency as possible. I agree with adding a test like this and adding a call to `.distinct`. Might as well take this to a logical conclusion. It also raises the interesting question: if you have >= k distinct points, and happen to pick < k distinct centroids, should you go back and replenish the set of centroids? I am punting on that right now but it's a legitimate point. It's quite a corner case though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84247187 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -323,7 +324,10 @@ class KMeans private ( * Initialize a set of cluster centers at random. */ private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = { -data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense) +// Select without replacement; may still produce duplicates if the data has < k distinct +// points, so deduplicate the centroids to match the behavior of k-means|| in the same situation +data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()). + map(_.vector).distinct.map(new VectorWithNorm(_)) --- End diff -- I'll move the dot. Agree, though computing k norms once seems negligible in the context of the overall computation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84247083 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { // Make sure code runs. var model = KMeans.train(data, k = 2, maxIterations = 1) -assert(model.clusterCenters.size === 2) +assert(model.clusterCenters.size === 1) --- End diff -- I will add more tests, yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84192775 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { // Make sure code runs. --- End diff -- while we're here, can we make `var model = ...` into `val model = ...` and delete L74 `val center = ...` which is not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84193444 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala --- @@ -75,7 +75,7 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext { // Make sure code runs. var model = KMeans.train(data, k = 2, maxIterations = 1) -assert(model.clusterCenters.size === 2) +assert(model.clusterCenters.size === 1) --- End diff -- these tests don't cover `initRandom`. We should add both initialization methods to be complete --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15450#discussion_r84194269 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -323,7 +324,10 @@ class KMeans private ( * Initialize a set of cluster centers at random. */ private def initRandom(data: RDD[VectorWithNorm]): Array[VectorWithNorm] = { -data.takeSample(true, k, new XORShiftRandom(this.seed).nextInt()).map(_.toDense) +// Select without replacement; may still produce duplicates if the data has < k distinct +// points, so deduplicate the centroids to match the behavior of k-means|| in the same situation +data.takeSample(false, k, new XORShiftRandom(this.seed).nextInt()). + map(_.vector).distinct.map(new VectorWithNorm(_)) --- End diff -- I think we generally put the `.` on the next line, but I'm not sure it's a strict style requirement. Also, we end up recomputing the norm of each sample. For large `k` this can be inefficient. I'm not sure of a non-ugly way to implement the distinct otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15450: [SPARK-3261] [MLLIB] KMeans clusterer can return ...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/15450 [SPARK-3261] [MLLIB] KMeans clusterer can return duplicate cluster centers ## What changes were proposed in this pull request? Return potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected. ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-3261 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15450 commit 42279b8e042aedaf54ae3900fc1b050d2a1dacef Author: Sean OwenDate: 2016-10-12T12:30:02Z Return potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org