[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 @HyukjinKwon Thanks for the comment. I will continue with the current approach unless objection raises. I will work on comments and refinements in the next day or two. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20295 For https://github.com/apache/spark/pull/20295#issuecomment-360297123, I am fine without new serialization protocol actually. I didn't have a strong preference there because I wasn't sure if it's worth - complexity vs actual gain vaguely and seems that's now clarified there. I am okay with the current approach. @BryanCutler, I think the intention here is to follow other few APIs and `gapply` in R. I guess you meant the length and metadata stuff by "an optional kwargs to each pandas_udf to deal with 0-param udfs" if I remember correctly. I think that's slightly different because here the motivation is to provide consistent support similarly with other APIs vs the `kwargs` thing sounds pretty few concept to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86651/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86651 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86651/testReport)** for PR 20295 at commit [`6b8fdbc`](https://github.com/apache/spark/commit/6b8fdbc61ce8ef03a5ac54d65babc29c0697c9c4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86651 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86651/testReport)** for PR 20295 at commit [`6b8fdbc`](https://github.com/apache/spark/commit/6b8fdbc61ce8ef03a5ac54d65babc29c0697c9c4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/241/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86604/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86604 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86604/testReport)** for PR 20295 at commit [`c7fccde`](https://github.com/apache/spark/commit/c7fccde701723c8808b095586e987ad02fc171c7). * This patch **fails PySpark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86606/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86606 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86606/testReport)** for PR 20295 at commit [`cda97f1`](https://github.com/apache/spark/commit/cda97f142285bc01a2da96765d3703d37b4cb802). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 Hi all, I did some digging and I think adding a serialization form that serialize a key object along with a Arrow record batch is quite complicated because we are using ArrowStreamReader/Writer for sending batches and send extra key data would have to use a lower level Arrow API for sending/receiving batches. I did two things to convince myself the current approach is fine: * I add logic to de duplicate grouping key they are already in data columns. i.e., if a user calls ``` df.groupby('id').apply(foo_udf) ``` We will not send extra grouping columns because those are already part of data columns. Instead, we will just use the corresponding data column to get grouping key to pass to user function. However, if user calls: ``` df.groupby(df.id % 2).apply(foo_udf) ``` then an extra column `df.id % 2` will be created and sent to python worker. But I think this is an uncommon case. * I did some benchmark to see the impact of sending extra grouping column. I used a Spark DataFrame of a single column to maximize the effect of the extra grouping column (basically sending extra grouping column will double the data to be sent to python in the benchmark, however in real use cases the effect of sending extra grouping columns should be far less). Even with the setting of the benchmark, I have not observed performance diffs when sending extra grouping columns, I think this is because the total time is dominated by other parts of the computation. [micro benchmark](https://gist.github.com/icexelloss/88f6c6fdaf04aac39d68d74cd0942c07) I'd like to leave the work for more flexible arrow serialization as future work because it doesn't seems to affect performance of this patch and proceed with the current patch based on the two points above. What do you guys think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/206/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86606 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86606/testReport)** for PR 20295 at commit [`cda97f1`](https://github.com/apache/spark/commit/cda97f142285bc01a2da96765d3703d37b4cb802). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/205/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86604 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86604/testReport)** for PR 20295 at commit [`c7fccde`](https://github.com/apache/spark/commit/c7fccde701723c8808b095586e987ad02fc171c7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/203/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 Let me experiment with new serialization approach. Will update here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20295 To me, seems roughly fine. > Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch. I don't have a strong preference on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 Yep, that's correct. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20295 How do we turn a single group column to a series? just repeat the group column? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 @cloud-fan Currently I sent group columns along with the extra data column. For example, if the original DataFrame has `id, v` and group column is `id`, the current implementation in this PR will send three series `id, id, v` to the python worker and send an `argOffsets` of `[1, 2]` to specify the data columns are `id, v`. The first value of the group column is used as the group key, because values in a group column are equal. I implemented it this way because it doesn't change the existing serialization protocol. Alternatively, we can implement a new serialization protocol for GROUP_MAP eval type, i.e, instead of sending an arrow batch, we could send a group row and then an arrow batch. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20295 How are you going to send the group columns? For a group we have only one group row and a bunch of data rows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86290/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86290/testReport)** for PR 20295 at commit [`7ce3fa7`](https://github.com/apache/spark/commit/7ce3fa79f8f63d4320c386ed81fa0b57d6bb7a91). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20295 cc @ueshin @HyukjinKwon @cloud-fan @viirya This PR implements discussion here https://github.com/apache/spark/pull/20211#pullrequestreview-87657832. There are more refinement needs to be done but I'd like to get some early feedback whether this approach looks good in general. The general idea is to pass grouping columns as extra columns to the python worker and to use `argOffsets` to specify which columns are grouping columns. Finally, we convert a grouping columns to a single tuple before enter user function. This is slightly inefficient because grouping columns always have the same value for each group. But I think this is OK because grouping columns should be relatively small comparing to the entire DataFrame. I can also implement some kind of de duplicate logic in `FlatMapGroupsInPandasExec`, however that would require creating another `UnsafeProjection`, I am not sure if it's worth it performance wise. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: [WIP][SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86290/testReport)** for PR 20295 at commit [`7ce3fa7`](https://github.com/apache/spark/commit/7ce3fa79f8f63d4320c386ed81fa0b57d6bb7a91). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86286/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86286/testReport)** for PR 20295 at commit [`38195ac`](https://github.com/apache/spark/commit/38195ac7f0b9e1227cfc1e407de47e276b3fc43f). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20295 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20295: wip: [SPARK-23011] Support alternative function form wit...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20295 **[Test build #86286 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86286/testReport)** for PR 20295 at commit [`38195ac`](https://github.com/apache/spark/commit/38195ac7f0b9e1227cfc1e407de47e276b3fc43f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org