[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20520


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-09 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167308210
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
+model = kmeans.fit(df)
+result = model.transform(df).rdd.collectAsMap()
+self.assertTrue(result[Vectors.dense([1.0, 1.0])] == 
result[Vectors.dense([10.0, 10.0])])
--- End diff --

yes, the order is preserved so you can do the same as doctest.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-09 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167165450
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
+model = kmeans.fit(df)
+result = model.transform(df).rdd.collectAsMap()
+self.assertTrue(result[Vectors.dense([1.0, 1.0])] == 
result[Vectors.dense([10.0, 10.0])])
--- End diff --

@BryanCutler is it guaranteed that the ordering is not changed? And is the 
ordering deterministic? I did it like this and now how you suggest because I 
thought that the ordering is not guaranteed, am I wrong?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167089783
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
+model = kmeans.fit(df)
+result = model.transform(df).rdd.collectAsMap()
+self.assertTrue(result[Vectors.dense([1.0, 1.0])] == 
result[Vectors.dense([10.0, 10.0])])
--- End diff --

I meant something like this, similar to how its done in the KMeans pydoc
```
>>> transformed = model.transform(df).select("features", "prediction")
>>> rows = transformed.collect()
>>> rows[0].prediction == rows[1].prediction
True
>>> rows[2].prediction == rows[3].prediction
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167089401
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
--- End diff --

It probably a little more common to put the param in the constructor, which 
should be tested also.  you put `setDistanceMeasure` in `test_kmeans_param` 
which seems better, not a big deal though


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-08 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167082120
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
--- End diff --

it was just to test that this method is working. Do you think it is better 
to switch to what you suggested?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-08 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167081534
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
+model = kmeans.fit(df)
+result = model.transform(df).rdd.collectAsMap()
+self.assertTrue(result[Vectors.dense([1.0, 1.0])] == 
result[Vectors.dense([10.0, 10.0])])
--- End diff --

sorry but I can't really understand what you are thinking about: here I 
compare that the prediction for two points contains the same value. Since the 
value is not known in advance I cannot just check if the dataframe is equal to 
something predefined. Am I missing something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167048301
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
--- End diff --

minor, but why not just `kmeans = KMeans(k=3, distanceMeasure="cosine", 
seed=1)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20520#discussion_r167048075
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
 self.assertEqual(s.k, 2)
 
 
+class KMeansTests(SparkSessionTestCase):
+
+def test_kmeans_cosine_distance(self):
+data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0, 
10.0]),),
+(Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0, 
4.4]),),
+(Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0, 
90.0]),)]
+df = self.spark.createDataFrame(data, ["features"])
+kmeans = KMeans(k=3, seed=1)
+kmeans.setDistanceMeasure("cosine")
+model = kmeans.fit(df)
+result = model.transform(df).rdd.collectAsMap()
+self.assertTrue(result[Vectors.dense([1.0, 1.0])] == 
result[Vectors.dense([10.0, 10.0])])
--- End diff --

It's a little awkward to collectAsMap and compare like this, why not just 
regular collect and compare with `data` above?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20520: [SPARK-23344][PYTHON][ML] Add distanceMeasure par...

2018-02-06 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/20520

[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans

## What changes were proposed in this pull request?

SPARK-22119 introduced a new parameter for KMeans, ie. `distanceMeasure`. 
The PR adds it also to the Python interface.

## How was this patch tested?

added UTs

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23344

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20520.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20520


commit 65da587cb32bbf002fbdd54353e49399a8b9de3f
Author: Marco Gaido 
Date:   2018-02-06T15:14:27Z

[SPARK-23344][PYTHON][ML] Add distanceMeasure param to KMeans




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org