Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143506845
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
Bryan, I haven't created. Go ahead!
On Fri, Oct 6, 2017 at 5:45 PM Bryan Cutler
wrote:
> Thanks all for the discussion. I think there are a lot of subtleties at
>
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143263694
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
If we all agree on the necessity of a design doc first, I can create a Jira
and we can make progress there.
What do you all think? @BryanCutler @gatorsmile @HyukjinKwon
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
I agree. I think some high level document describing these differences so
we can discuss it. I think we should be more careful about Arrow-version
behavior before releasing support for timestamp
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
> The baseline should be (as said above): Internal optimisation should not
introduce any behaviour change, and we are discouraged to change the previous
behaviour unless it has bugs in gene
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
Hi All, I think all comments should be addressed at this point, except for
the naming comment from @rxin.
If I missed something or if there is anything else you want me to address
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143213000
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143198047
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
cc @ueshin
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
Thanks @gatorsmile for the constructive feedback!
I don't want to make this more complicated but I also want to make sure we
are aware that there is also difference between Arro
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
I agree with Bryan. I think we might want to rethink the assumption that
toPandas result with arrow / without arrow should be 100% the same.
For instance, non-Arrow doesn't re
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143083397
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143081782
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143081592
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143032320
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r143033289
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142961120
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
@HyukjinKwon Thanks for the summarry!
* https://github.com/apache/spark/pull/18732#discussion_r142735696
`ArrowPandasSerialzer`I will spend some time address this today.
* https
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142957552
--- Diff: python/pyspark/sql/functions.py ---
@@ -2058,7 +2058,7 @@ def __init__(self, func, returnType, name=None,
vectorized=False):
self
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142956597
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142949557
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142948551
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142945465
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142944123
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142845456
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142840490
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
---
@@ -519,3 +519,18 @@ case class CoGroup
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142839010
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -26,6 +26,25 @@ import
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
I pushed a new commit addressing the comments. Let me scan through the
comments again. I think there are some comments around worker.py not being
addressed yet
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142801623
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -26,6 +26,25 @@ import
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142796899
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
---
@@ -519,3 +519,18 @@ case class CoGroup
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142770337
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,89 @@
+/*
+ * Licensed
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142740947
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -435,6 +435,33 @@ class RelationalGroupedDataset protected
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142704642
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142704126
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -435,6 +435,33 @@ class RelationalGroupedDataset protected
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142703829
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -435,6 +435,33 @@ class RelationalGroupedDataset protected
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142703487
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -435,6 +435,33 @@ class RelationalGroupedDataset protected
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142697418
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142695929
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,95 @@
+/*
+ * Licensed
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142695843
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala
---
@@ -111,6 +111,9 @@ object ExtractPythonUDFs
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142695501
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -26,6 +26,25 @@ import
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142695129
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
---
@@ -519,3 +519,18 @@ case class CoGroup
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142694835
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142694484
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142694381
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142693686
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142693843
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +74,35 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142692448
--- Diff: python/pyspark/sql/tests.py ---
@@ -3106,8 +3106,9 @@ def assertFramesEqual(self, df_with_arrow,
df_without):
self.assertTrue
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142691179
--- Diff: python/pyspark/sql/tests.py ---
@@ -3106,8 +3106,9 @@ def assertFramesEqual(self, df_with_arrow,
df_without):
self.assertTrue
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142690650
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142690602
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142689702
--- Diff: python/pyspark/sql/tests.py ---
@@ -3106,8 +3106,9 @@ def assertFramesEqual(self, df_with_arrow,
df_without):
self.assertTrue
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142678914
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,89 @@
+/*
+ * Licensed
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142583590
--- Diff: python/pyspark/sql/group.py ---
@@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142583338
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -44,14 +63,17 @@ case class
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142572643
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -44,14 +63,22 @@ case class
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142572356
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -47,7 +47,7 @@ import org.apache.spark.sql.types.StructType
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142571653
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +75,37 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142571660
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +75,37 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142571075
--- Diff: python/pyspark/worker.py ---
@@ -32,8 +32,9 @@
from pyspark.serializers import write_with_length, write_int, read_long
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142571047
--- Diff: python/pyspark/sql/tests.py ---
@@ -3376,6 +3377,132 @@ def test_vectorized_udf_empty_partition(self):
res = df.select(f(col(
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142571038
--- Diff: python/pyspark/sql/tests.py ---
@@ -3376,6 +3377,132 @@ def test_vectorized_udf_empty_partition(self):
res = df.select(f(col(
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142571056
--- Diff: python/pyspark/sql/tests.py ---
@@ -3376,6 +3377,133 @@ def test_vectorized_udf_empty_partition(self):
res = df.select(f(col(
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142533141
--- Diff: python/pyspark/sql/functions.py ---
@@ -2058,7 +2058,7 @@ def __init__(self, func, returnType, name=None,
vectorized=False):
self
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142523354
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142518730
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -26,6 +26,28 @@ import
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142500980
--- Diff: python/pyspark/sql/functions.py ---
@@ -2129,8 +2130,12 @@ def _create_udf(f, returnType, vectorized):
def _udf(f, returnType
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142499712
--- Diff: python/pyspark/sql/functions.py ---
@@ -2129,8 +2130,12 @@ def _create_udf(f, returnType, vectorized):
def _udf(f, returnType
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142499286
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
---
@@ -519,3 +519,18 @@ case class CoGroup
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142499092
--- Diff: python/pyspark/sql/functions.py ---
@@ -2120,6 +2120,7 @@ def wrapper(*args):
else self.func.__class__
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142498939
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,91 @@
+/*
+ * Licensed
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142498841
--- Diff: python/pyspark/worker.py ---
@@ -74,17 +75,33 @@ def wrap_udf(f, return_type):
def wrap_pandas_udf(f, return_type
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142498880
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142497616
--- Diff: python/pyspark/sql/functions.py ---
@@ -2120,6 +2120,7 @@ def wrapper(*args):
else self.func.__class__
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142486439
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142482842
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142482577
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142478440
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142474570
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142448316
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142446103
--- Diff: python/pyspark/sql/types.py ---
@@ -1624,6 +1624,34 @@ def toArrowType(dt):
return arrow_type
+def from_pandas_type(dt
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142445946
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142440939
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142440350
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142439787
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142439736
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142439639
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142438748
--- Diff: python/pyspark/worker.py ---
@@ -32,8 +32,9 @@
from pyspark.serializers import write_with_length, write_int, read_long
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142438464
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,91 @@
+/*
+ * Licensed
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142438142
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala
---
@@ -44,14 +66,24 @@ case class
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142438108
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r142421666
--- Diff: python/pyspark/sql/functions.py ---
@@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()):
@since(2.3)
def pandas_udf(f
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
@HyukjinKwon Thanks for the feedback. I will address those and update
tomorrow.
---
-
To unsubscribe, e-mail: reviews
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
@rxin, `transform` takes a function: pd.Series -> pd.Series and apply the
function on all columns:
```
df.show()
id v1 v2 v3
a 1.0 4.0 0.0
a 2.0
Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18732
@rxin This is similar to flatMapGroups since the return value of the user
function is a list of rows (pd.DataFrame) rather than a single row
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r141976177
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,28 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r141955128
--- Diff: python/pyspark/sql/functions.py ---
@@ -2129,7 +2129,8 @@ def _create_udf(f, returnType, vectorized):
def _udf(f, returnType
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r141953540
--- Diff: python/pyspark/sql/group.py ---
@@ -194,6 +194,28 @@ def pivot(self, pivot_col, values=None):
jgd = self._jgd.pivot(pivot_col
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r141953502
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,95 @@
+/*
+ * Licensed
Github user icexelloss commented on a diff in the pull request:
https://github.com/apache/spark/pull/18732#discussion_r141953474
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala
---
@@ -0,0 +1,95 @@
+/*
+ * Licensed
601 - 700 of 747 matches
Mail list logo