[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222698014
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -310,9 +319,11 @@ def register(self, name, f, returnType=None):
 "Invalid returnType: data type can not be specified 
when f is"
 "a user-defined function, but got %s." % returnType)
 if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
-  PythonEvalType.SQL_SCALAR_PANDAS_UDF]:
+  PythonEvalType.SQL_SCALAR_PANDAS_UDF,
+  
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]:
--- End diff --

I opened https://issues.apache.org/jira/browse/SPARK-25640 to track this.

To be clear, this is transparent to end users, but I agree it can be 
confusing to developers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22620


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222487117
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -310,9 +319,11 @@ def register(self, name, f, returnType=None):
 "Invalid returnType: data type can not be specified 
when f is"
 "a user-defined function, but got %s." % returnType)
 if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
-  PythonEvalType.SQL_SCALAR_PANDAS_UDF]:
+  PythonEvalType.SQL_SCALAR_PANDAS_UDF,
+  
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]:
--- End diff --

These need to be clearly defined in Apache Spark 3.0 release; otherwise, it 
might be confusing to both developers and end users. :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222456993
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -310,9 +319,11 @@ def register(self, name, f, returnType=None):
 "Invalid returnType: data type can not be specified 
when f is"
 "a user-defined function, but got %s." % returnType)
 if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
-  PythonEvalType.SQL_SCALAR_PANDAS_UDF]:
+  PythonEvalType.SQL_SCALAR_PANDAS_UDF,
+  
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]:
--- End diff --

We don't need it here:

Users specify GROUPED_AGG only. GROUPED_AGG is turned to WINDOW_AGG eval 
type in WindowInPandasExec.

Admittedly, there is a bit confusion here we can improve. We just haven't 
got a user specified udf type that maps to multiple evalType before WINDOW_AGG.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222452544
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -310,9 +319,11 @@ def register(self, name, f, returnType=None):
 "Invalid returnType: data type can not be specified 
when f is"
 "a user-defined function, but got %s." % returnType)
 if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF,
-  PythonEvalType.SQL_SCALAR_PANDAS_UDF]:
+  PythonEvalType.SQL_SCALAR_PANDAS_UDF,
+  
PythonEvalType.SQL_GROUPED_AGG_PANDAS_UDF]:
--- End diff --

how about SQL_WINDOW_AGG_PANDAS_UDF?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222421940
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -298,6 +298,15 @@ def register(self, name, f, returnType=None):
 >>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
 [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 
+>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # 
doctest: +SKIP
+... def sum_udf(v):
+... return v.sum()
+...
+>>> _ = spark.udf.register("sum_udf", sum_udf)  # doctest: 
+SKIP
--- End diff --

Ha. I see..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222417513
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -298,6 +298,15 @@ def register(self, name, f, returnType=None):
 >>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
 [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 
+>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # 
doctest: +SKIP
+... def sum_udf(v):
+... return v.sum()
+...
+>>> _ = spark.udf.register("sum_udf", sum_udf)  # doctest: 
+SKIP
--- End diff --

Hide the output like ...

```
>>> spark.udf.register("sum_udf", sum_udf)

```

in the doctest.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/22620#discussion_r222411585
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -298,6 +298,15 @@ def register(self, name, f, returnType=None):
 >>> spark.sql("SELECT add_one(id) FROM range(3)").collect()  # 
doctest: +SKIP
 [Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)]
 
+>>> @pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # 
doctest: +SKIP
+... def sum_udf(v):
+... return v.sum()
+...
+>>> _ = spark.udf.register("sum_udf", sum_udf)  # doctest: 
+SKIP
--- End diff --

what is the "_ =" thing here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22620: [SPARK-25601][PYTHON] Register Grouped aggregate ...

2018-10-03 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22620

[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for 
SQL Statement

## What changes were proposed in this pull request?

This PR proposes to register Grouped aggregate UDF Vectorized UDFs for SQL 
Statement, for instance:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf("integer", PandasUDFType.GROUPED_AGG)  # doctest: +SKIP
def sum_udf(v):
return v.sum()

spark.udf.register("sum_udf", sum_udf)  # doctest: +SKIP
q = "SELECT sum_udf(v1) FROM VALUES (3, 0), (2, 0), (1, 1) tbl(v1, v2) 
GROUP BY v2"
spark.sql(q).show()

+---+
|sum_udf(v1)|
+---+
|  1|
|  5|
+---+
```

## How was this patch tested?

Manual test and unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25601

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22620.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22620


commit 06a7bd0c7daed0f3af5b42c1ea8a9b4b5e2e6216
Author: hyukjinkwon 
Date:   2018-10-03T11:13:32Z

Register Grouped aggregate UDF Vectorized UDFs for SQL Statement




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org