[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2017-01-23 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 Bryan, I am working on: (1) Add more numbers to benchmark.py (2) Add support for date/timestamp/binary type (3) Fix memory leaking in the code. All these should

[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...

2016-12-01 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler , I have been working based on your branch here: https://github.com/BryanCutler/spark/tree/wip-toPandas_with_arrow-SPARK-13534 Is this the right one? --- If your

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-03-24 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler , there appears to be some stability issue in the current code. I am tried to repeated collect a DataFrame as Arrow BatchRecord in spark-shell and discovered that executors start

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-03-27 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler Thanks! The issue is fixed with your new update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-03-27 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler I think this patch is in a good shape that I want to release this code internally in Two Sigma for beta users. My understanding is support for timestamp and date

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-03-27 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 Got it. Can you put up a patch to throw exception for timestamp and date type to arrow-integration branch? I would do it but I don't have my laptop with me now... --- If your project is set up

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-03-24 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler from the arrow-integration branch. Where is the memory leak patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-07-29 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 It looks like "SESSION_LOCAL_TIMEZONE" is not respected in most of the pyspark functionality. I think `df.collect()` and `df.toPandas` can be fixed to respect SESSION_LOCA

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-07-31 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 @ueshin I amd +1 for fixing `df.collect()` and `df.toPandas()`, I don't think it is much of a backward-compatibility issue because the current behavior of `df.collect()` and `df.toPandas

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-08-01 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 To Wes's concern, I think we are only dealing with values in UTC here, both Spark and Arrow internally represents timestamp as microseconds since epoch. To the two issues Bryan

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

2017-08-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r131227296 --- Diff: python/pyspark/sql/tests.py --- @@ -3036,6 +3052,9 @@ def test_toPandas_arrow_toggle(self): pdf = df.toPandas

[GitHub] spark pull request #18933: [WIP][SPARK-21722][SQL][PYTHON] Enable timezone-a...

2017-08-15 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18933#discussion_r133229705 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -912,6 +912,14 @@ object SQLConf { .intConf

[GitHub] spark pull request #18732: groupby().apply() with pandas udf

2017-07-25 Thread icexelloss
GitHub user icexelloss opened a pull request: https://github.com/apache/spark/pull/18732 groupby().apply() with pandas udf ## What changes were proposed in this pull request? This PR adds an apply() function on df.groupby(). apply() takes a pandas udf

[GitHub] spark pull request #18659: [SPARK-21404][PYSPARK][WIP] Simple Python Vectori...

2017-07-25 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18659#discussion_r129412101 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -132,6 +135,61 @@ private[sql] object

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-07-24 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 Excited to see this being worked on. > SQLConf.SESSION_LOCAL_TIMEZONE I like this the best. This presents timestamp in local time which is compatible with the existing `toPan

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-27 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r113728111 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -0,0 +1,396 @@ +/* +* Licensed

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-27 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r113728646 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -0,0 +1,396 @@ +/* +* Licensed

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-27 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r113729309 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -0,0 +1,396 @@ +/* +* Licensed

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-04-27 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r113730387 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala --- @@ -0,0 +1,396 @@ +/* +* Licensed

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-04-27 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 > An instance of this must be used each time a ArrowRecordBatch is created and then the batch and allocator must be released/closed after they have been processed I think it wo

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Tim...

2017-07-28 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18664#discussion_r130201966 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowUtils.scala --- @@ -42,6 +43,9 @@ object ArrowUtils { case

[GitHub] spark pull request #18787: [SPARK-21583][SQL] Create a ColumnarBatch from Ar...

2017-08-09 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18787#discussion_r132188490 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java --- @@ -65,15 +65,35 @@ final Row row

[GitHub] spark pull request #18787: [SPARK-21583][SQL] Create a ColumnarBatch from Ar...

2017-08-09 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18787#discussion_r132187027 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnarBatch.java --- @@ -65,15 +65,35 @@ final Row row

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-05-15 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler , is Timestamp and Date type supported now with Arrow 0.3? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...

2017-05-16 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/15821 >@icexelloss , yes Arrow supports it but Spark stores timestamps is a different way which caused some complication. After talking with Holden, we agreed it was better to keep this PR to sim

[GitHub] spark issue #19284: [SPARK-22067][SQL] ArrowWriter should use position when ...

2017-09-19 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/19284 LGTM. What's the Arrow bug you mentioned? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142845456 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,67 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142840490 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-04 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 I pushed a new commit addressing the comments. Let me scan through the comments again. I think there are some comments around worker.py not being addressed yet

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142839010 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -26,6 +26,25 @@ import

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 If we all agree on the necessity of a design doc first, I can create a Jira and we can make progress there. What do you all think? @BryanCutler @gatorsmile @HyukjinKwon

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-06 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143263694 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 > The baseline should be (as said above): Internal optimisation should not introduce any behaviour change, and we are discouraged to change the previous behaviour unless it has bugs in gene

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 I agree. I think some high level document describing these differences so we can discuss it. I think we should be more careful about Arrow-version behavior before releasing support for timestamp

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-06 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 Hi All, I think all comments should be addressed at this point, except for the naming comment from @rxin. If I missed something or if there is anything else you want me to address

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-06 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143198047 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-06 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143213000 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

2017-10-06 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18664 Bryan, I haven't created. Go ahead! On Fri, Oct 6, 2017 at 5:45 PM Bryan Cutler <notificati...@github.com> wrote: > Thanks all for the discussion. I think there

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142583590 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142583338 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -44,14 +63,17 @@ case class

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142678914 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142689702 --- Diff: python/pyspark/sql/tests.py --- @@ -3106,8 +3106,9 @@ def assertFramesEqual(self, df_with_arrow, df_without): self.assertTrue

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142690650 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142690602 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,66 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142694484 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142694381 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142697418 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142692448 --- Diff: python/pyspark/sql/tests.py --- @@ -3106,8 +3106,9 @@ def assertFramesEqual(self, df_with_arrow, df_without): self.assertTrue

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142695501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -26,6 +26,25 @@ import

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142695129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142694835 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142691179 --- Diff: python/pyspark/sql/tests.py --- @@ -3106,8 +3106,9 @@ def assertFramesEqual(self, df_with_arrow, df_without): self.assertTrue

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142704126 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -435,6 +435,33 @@ class RelationalGroupedDataset protected

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142703487 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -435,6 +435,33 @@ class RelationalGroupedDataset protected

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142693843 --- Diff: python/pyspark/worker.py --- @@ -74,17 +74,35 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142703829 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -435,6 +435,33 @@ class RelationalGroupedDataset protected

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142498880 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142498841 --- Diff: python/pyspark/worker.py --- @@ -74,17 +75,33 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142498939 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142482842 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,31 +2186,69 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142486439 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142497616 --- Diff: python/pyspark/sql/functions.py --- @@ -2120,6 +2120,7 @@ def wrapper(*args): else self.func.__class__

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142500980 --- Diff: python/pyspark/sql/functions.py --- @@ -2129,8 +2130,12 @@ def _create_udf(f, returnType, vectorized): def _udf(f, returnType

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142474570 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142499286 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142499092 --- Diff: python/pyspark/sql/functions.py --- @@ -2120,6 +2120,7 @@ def wrapper(*args): else self.func.__class__

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142499712 --- Diff: python/pyspark/sql/functions.py --- @@ -2129,8 +2130,12 @@ def _create_udf(f, returnType, vectorized): def _udf(f, returnType

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142478440 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-04 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142740947 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -435,6 +435,33 @@ class RelationalGroupedDataset protected

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142518730 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -26,6 +26,28 @@ import

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142533141 --- Diff: python/pyspark/sql/functions.py --- @@ -2058,7 +2058,7 @@ def __init__(self, func, returnType, name=None, vectorized=False): self

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142523354 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,65 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142482577 --- Diff: python/pyspark/sql/group.py --- @@ -194,6 +194,37 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142571075 --- Diff: python/pyspark/worker.py --- @@ -32,8 +32,9 @@ from pyspark.serializers import write_with_length, write_int, read_long

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142571047 --- Diff: python/pyspark/sql/tests.py --- @@ -3376,6 +3377,132 @@ def test_vectorized_udf_empty_partition(self): res = df.select(f(col('id

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142571038 --- Diff: python/pyspark/sql/tests.py --- @@ -3376,6 +3377,132 @@ def test_vectorized_udf_empty_partition(self): res = df.select(f(col('id

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142571056 --- Diff: python/pyspark/sql/tests.py --- @@ -3376,6 +3377,133 @@ def test_vectorized_udf_empty_partition(self): res = df.select(f(col('id

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142571653 --- Diff: python/pyspark/worker.py --- @@ -74,17 +75,37 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142571660 --- Diff: python/pyspark/worker.py --- @@ -74,17 +75,37 @@ def wrap_udf(f, return_type): def wrap_pandas_udf(f, return_type

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142572356 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -47,7 +47,7 @@ import org.apache.spark.sql.types.StructType

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-03 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r142572643 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -44,14 +63,22 @@ case class

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143507748 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,18 @@ case class CoGroup

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-09 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143506845 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,69 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143740636 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -519,3 +519,4 @@ case class CoGroup

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143741944 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -44,14 +73,18 @@ case class

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143740157 --- Diff: python/pyspark/sql/tests.py --- @@ -3376,6 +3376,151 @@ def test_vectorized_udf_empty_partition(self): res = df.select(f(col('id

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143740078 --- Diff: python/pyspark/sql/tests.py --- @@ -3376,6 +3376,151 @@ def test_vectorized_udf_empty_partition(self): res = df.select(f(col('id

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143740129 --- Diff: python/pyspark/sql/tests.py --- @@ -3376,6 +3376,151 @@ def test_vectorized_udf_empty_partition(self): res = df.select(f(col('id

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143744197 --- Diff: python/pyspark/sql/functions.py --- @@ -2181,30 +2187,66 @@ def udf(f=None, returnType=StringType()): @since(2.3) def pandas_udf(f

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143740882 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala --- @@ -0,0 +1,43

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143740773 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala --- @@ -0,0 +1,43

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143810355 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,84 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143810539 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,84 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143810736 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,84 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143812619 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala --- @@ -44,14 +73,18 @@ case class

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143812311 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -435,6 +435,35 @@ class RelationalGroupedDataset protected

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143813642 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/FlatMapGroupsInPandasExec.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143809711 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,84 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark pull request #18732: [SPARK-20396][SQL][PySpark] groupby().apply() wit...

2017-10-10 Thread icexelloss
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/18732#discussion_r143810948 --- Diff: python/pyspark/sql/group.py --- @@ -192,7 +193,84 @@ def pivot(self, pivot_col, values=None): jgd = self._jgd.pivot(pivot_col

[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-10 Thread icexelloss
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/18732 @HyukjinKwon Thanks! Thanks for everyone for reviewing this tirelessly. --- - To unsubscribe, e-mail: reviews

  1   2   3   4   5   6   7   8   >