[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 I just opened another PR for adding a configuration - https://github.com/apache/spark/pull/20678. Let me close this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 Thanks! Happy Lunar New Year! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 I just opened https://github.com/apache/spark/pull/20625. I believe this is the smallest and simplest change .. Will turn this PR to add a configuration later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 Yup, I will. Sorry for delaying it. I was trying to make the fix small as possible as I can. Let me just open it as a simplest way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 @HyukjinKwon Will you submit a fix for the binary type today? We are very close to RC4. This is kind of urgent if we still want to block it in the Spark 2.3.0 release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20567 ^ this change LGTM. Can we make a PR for this change only and leave the fallback part for Spark 2.4? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 > The binary type bug sounds like a blocker, can we just fix it surgically by checking the supported data types before going to the arrow optimization path? For now we can stick with that the current behavior is, i.e. throw exception. That's basically (https://github.com/apache/spark/pull/20567#issuecomment-365064243): ```python if # 'spark.sql.execution.arrow.enabled' true? require_minimum_pyarrow_version() try: to_arrow_schema(self.schema) # return the one with Arrow except Exception as e: raise Exception("'spark.sql.execution.arrow.enabled' blah blah ...") else: # return the one without Arrow ``` because `to_arrow_schema(self.schema)` checks the supported types like other Pandas/Arrow functionalities. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20567 The binary type bug sounds like a blocker, can we just fix it surgically by checking the supported data types before going to the arrow optimization path? For now we can stick with that the current behavior is, i.e. throw exception. The inconsistent behavior between `toPandas` and `createDataFrame` is confusing but may not be a blocker. We can fix it in Spark 2.4 and add a note in the migration guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 The root cause is Arrow conversion in Python side interprets binaries as `str`, and I here avoided this by checking if the type is what we supported or not. This is the most trivial fix. I made a fix safe and small as possible as I can here. I can fix the error message only but the size of change and diff is virtually the same - https://github.com/apache/spark/pull/20567#issuecomment-365064243. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 What is the root cause? Do we have a trivial fix to resolve/block it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 There is one more thing - https://github.com/apache/spark/pull/20567#issuecomment-364639922 We didn't complete binary type support yet in Python side but there is a hole here .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 The behavior inconsistency between `toPandas` and `createDataFrame` looks confusing to end users, I have to admit. In the current stage, we are unable to merge the fix for these new features to Spark 2.3 branch. Let us wait for the release of Spark 2.3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 I mean the actual change here is small. The diff maybe looks larger here because of removed `else`. Please check out the diff. It's quite a safe change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 Then, let us wait for the release of Spark 2.3.0. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 Just FYI, except option 3., the complexity in other options and the PR size will be all similar - https://github.com/apache/spark/pull/20567#issuecomment-364806378 and https://github.com/apache/spark/pull/20567#issuecomment-365064243 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 We are unable to contain option 3 in Spark 2.3.0. This is too big to merge it in the current stage. We still can do it in 2.3.1. If needed, I am fine to throw a better error message if the PR size is very small; otherwise, keep it unchanged in 2.3.0. Also cc @liancheng @yhuai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 @gatorsmile and @rxin, The problem here is that `toPandas` just fails on unsupported types later and allows `BinaryType` with inconsistent conversion (https://github.com/apache/spark/pull/20567#issuecomment-364639922) in Arrow whereas `createDataFrame` allows fallback in both cases. This is the last one left (for now) about PySpark/Pandas interoperability which I found during testing out and I was thinking about targeting 2.3.0. So, for clarification, would you be uncomfortable with one of: 1. matching both toPandas and createDataFrame to fallback with a warning 2. matching both toPandas and createDataFrame to throw an exception 3. adding a configuration to control the fallback for both to target 2.3.0 (or 2.3.1 if the vote fails)? FYI, the current one in this PR is 1. If so, let me have two PRs, one for the error message for now to target 2.3.0 (or 2.3.1 if the vote fails), and one for adding a configuration to control the fallback to target master (and maybe 2.3.1). Does that make sense to both of you? cc @cloud-fan too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 RC3 is out. This change could be in 2.3.1 f the vote passes, or in 2.3.0 If the vote fails. For the reason above, we can't backport and change anything in the main codes until the release 2.3.0. So, you are worried of delaying the release more because it has been delayed pretty much already? I understand this but I would like to ask to get this (whether it throws an exception for both `toPandas` and `createDataFrame` or fallback for both) in to make the new feature out with consistency please. @rxin, do you think we should take this out in 2.3.0 too? Was this your opinion (https://github.com/apache/spark/pull/20567#issuecomment-365099515)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 RC3 is out. Just to avoid new regressions that might be introduced in the new PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 ^ I am not saying that we should merge it now. I can do the opposite for `createDataFrame` given https://github.com/apache/spark/pull/20567#issuecomment-365100434 . My point is why it should be exclueded in 2.3.0 specifically while this can be considered as a backport - https://github.com/apache/spark/pull/20567#issuecomment-365077913 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 > Is there any specific worry from this change, that might shake the 2.3.0 release speficially? In this way, we can't backport anything. I am surprised that this PR is considered to be excluded specifically in 2.3.0. Yeah. This PR is not ready to merge yet. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 > I thought this is another step. We need to make them consistent first. Based on the comments from @icexelloss , I do not think we should blindly switch back to the original version. At least, provide an option to the end users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20567 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87355/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20567 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20567 **[Test build #87355 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87355/testReport)** for PR 20567 at commit [`42dec46`](https://github.com/apache/spark/commit/42dec467df4c332cedf474623243db7c929881d7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 The feedback is partially from @rxin Maybe he can provide more inputs later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 > This issue does not cause the regression since spark.sql.execution.arrow.enabled is off by default. It doesn't block the release but we can still backport it because it fixes an actual bug fix with a minimal change whether 2.3.0 is released or not. > We need to make it configurable before merging it I thought this is another step. We need to make them consistent first. > Merging this to 2.3.0 might cause the regression and impacts the release date of SPARK 2.3 Is there any specific worry from this change, that might shake the 2.3.0 release speficially? In this way, we can't backport anything. I am surprised that this PR is considered to be excluded specifically in 2.3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 This issue does not cause the regression since `spark.sql.execution.arrow.enabled` is off by default. We need to make it configurable before merging it. Merging this to 2.3.0 might cause the regression and impacts the release date of SPARK 2.3. Thus, we would suggest to delay merging it until 2.3.0 is out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20567 **[Test build #87355 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87355/testReport)** for PR 20567 at commit [`42dec46`](https://github.com/apache/spark/commit/42dec467df4c332cedf474623243db7c929881d7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20567 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20567 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/827/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20567 > My proposal is to merge the fix after the 2.3 release. We can still backport it to SPARK 2.3, but it will be available in SPARK 2.3.1. Mind if I ask to elaborate why? Want to know why this one is specially excluded in 2.3.0 alone although it can be backported to branch-2.3. I thought it's good to add it into 2.3.0 because this this is kind of safe, fixes a actual bug and matches the behaviour with `createDataFrame` too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20567 My proposal is to merge the fix after the 2.3 release. We can still backport it to SPARK 2.3, but it will be available in SPARK 2.3.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org