[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20520 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167308210 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") +model = kmeans.fit(df) +result = model.transform(df).rdd.collectAsMap() +self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])]) --- End diff -- yes, the order is preserved so you can do the same as doctest. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167165450 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") +model = kmeans.fit(df) +result = model.transform(df).rdd.collectAsMap() +self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])]) --- End diff -- @BryanCutler is it guaranteed that the ordering is not changed? And is the ordering deterministic? I did it like this and now how you suggest because I thought that the ordering is not guaranteed, am I wrong? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167089783 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") +model = kmeans.fit(df) +result = model.transform(df).rdd.collectAsMap() +self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])]) --- End diff -- I meant something like this, similar to how its done in the KMeans pydoc ``` >>> transformed = model.transform(df).select("features", "prediction") >>> rows = transformed.collect() >>> rows[0].prediction == rows[1].prediction True >>> rows[2].prediction == rows[3].prediction ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167089401 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") --- End diff -- It probably a little more common to put the param in the constructor, which should be tested also. you put `setDistanceMeasure` in `test_kmeans_param` which seems better, not a big deal though --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167082120 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") --- End diff -- it was just to test that this method is working. Do you think it is better to switch to what you suggested? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167081534 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") +model = kmeans.fit(df) +result = model.transform(df).rdd.collectAsMap() +self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])]) --- End diff -- sorry but I can't really understand what you are thinking about: here I compare that the prediction for two points contains the same value. Since the value is not known in advance I cannot just check if the dataframe is equal to something predefined. Am I missing something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167048301 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") --- End diff -- minor, but why not just `kmeans = KMeans(k=3, distanceMeasure="cosine", seed=1)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20520#discussion_r167048075 --- Diff: python/pyspark/ml/tests.py --- @@ -1620,6 +1621,23 @@ def test_kmeans_summary(self): self.assertEqual(s.k, 2) +class KMeansTests(SparkSessionTestCase): + +def test_kmeans_cosine_distance(self): +data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 10.0]),), +(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 4.4]),), +(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 90.0]),)] +df = self.spark.createDataFrame(data, ["features"]) +kmeans = KMeans(k=3, seed=1) +kmeans.setDistanceMeasure("cosine") +model = kmeans.fit(df) +result = model.transform(df).rdd.collectAsMap() +self.assertTrue(result[Vectors.dense([1.0, 1.0])] == result[Vectors.dense([10.0, 10.0])]) --- End diff -- It's a little awkward to collectAsMap and compare like this, why not just regular collect and compare with `data` above? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/20520 [SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans ## What changes were proposed in this pull request? SPARK-22119 introduced a new parameter for KMeans, ie. `distanceMeasure`. The PR adds it also to the Python interface. ## How was this patch tested? added UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-23344 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20520.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20520 commit 65da587cb32bbf002fbdd54353e49399a8b9de3f Author: Marco Gaido Date: 2018-02-06T15:14:27Z [SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org