[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
I just opened another PR for adding a configuration - 
https://github.com/apache/spark/pull/20678. Let me close this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
Thanks! Happy Lunar New Year!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
I just opened https://github.com/apache/spark/pull/20625. I believe this is 
the smallest and simplest change .. 

Will turn this PR to add a configuration later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
Yup, I will. Sorry for delaying it. I was trying to make the fix small as 
possible as I can. Let me just open it as a simplest way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
@HyukjinKwon Will you submit a fix for the binary type today? We are very 
close to RC4. This is kind of urgent if we still want to block it in the Spark 
2.3.0 release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20567
  
^ this change LGTM. Can we make a PR for this change only and leave the 
fallback part for Spark 2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
> The binary type bug sounds like a blocker, can we just fix it surgically 
by checking the supported data types before going to the arrow optimization 
path? For now we can stick with that the current behavior is, i.e. throw 
exception.

That's basically 
(https://github.com/apache/spark/pull/20567#issuecomment-365064243): 

```python
if # 'spark.sql.execution.arrow.enabled' true?
require_minimum_pyarrow_version()
try:
to_arrow_schema(self.schema)
# return the one with Arrow
except Exception as e:
raise Exception("'spark.sql.execution.arrow.enabled' blah blah ...")
else:
# return the one without Arrow
```

because `to_arrow_schema(self.schema)` checks the supported types like 
other Pandas/Arrow functionalities.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20567
  
The binary type bug sounds like a blocker, can we just fix it surgically by 
checking the supported data types before going to the arrow optimization path? 
For now we can stick with that the current behavior is, i.e. throw exception.

The inconsistent behavior between `toPandas` and `createDataFrame` is 
confusing but may not be a blocker. We can fix it in Spark 2.4 and add a note 
in the migration guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
The root cause is Arrow conversion in Python side interprets binaries as 
`str`, and I here avoided this by checking if the type is what we supported or 
not.

This is the most trivial fix. I made a fix safe and small as possible as I 
can here. I can fix the error message only but the size of change and diff is 
virtually the same - 
https://github.com/apache/spark/pull/20567#issuecomment-365064243.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
What is the root cause? Do we have a trivial fix to resolve/block it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
There is one more thing - 
https://github.com/apache/spark/pull/20567#issuecomment-364639922 We didn't 
complete binary type support yet in Python side but there is a hole here ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
The behavior inconsistency between `toPandas` and `createDataFrame` looks 
confusing to end users, I have to admit. 

In the current stage, we are unable to merge the fix for these new features 
to Spark 2.3 branch. Let us wait for the release of Spark 2.3.0  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
I mean the actual change here is small. The diff maybe looks larger here 
because of removed `else`. Please check out the diff. It's quite a safe change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
Then, let us wait for the release of Spark 2.3.0. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
Just FYI, except option 3., the complexity in other options and the PR size 
will be all similar - 
https://github.com/apache/spark/pull/20567#issuecomment-364806378 and 
https://github.com/apache/spark/pull/20567#issuecomment-365064243


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
We are unable to contain option 3 in Spark 2.3.0. This is too big to merge 
it in the current stage. We still can do it in 2.3.1. 

If needed, I am fine to throw a better error message if the PR size is very 
small; otherwise, keep it unchanged in 2.3.0.

Also cc @liancheng @yhuai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
@gatorsmile and @rxin,

The problem here is that `toPandas` just fails on unsupported types later 
and allows `BinaryType` with inconsistent conversion 
(https://github.com/apache/spark/pull/20567#issuecomment-364639922) in Arrow 
whereas `createDataFrame` allows fallback in both cases.

This is the last one left (for now) about PySpark/Pandas interoperability 
which I found during testing out and I was thinking about targeting 2.3.0.

So, for clarification, would you be uncomfortable with one of:

1. matching both toPandas and createDataFrame to fallback with a warning
2. matching both toPandas and createDataFrame to throw an exception
3. adding a configuration to control the fallback for both

to target 2.3.0 (or 2.3.1 if the vote fails)? FYI, the current one in this 
PR is 1.

If so, let me have two PRs, one for the error message for now to target 
2.3.0 (or 2.3.1 if the vote fails), and one for adding a configuration to 
control the fallback to target master (and maybe 2.3.1).

Does that make sense to both of you?

cc @cloud-fan too.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
RC3 is out. This change could be in 2.3.1 f the vote passes, or in 2.3.0 If 
the vote fails. For the reason above, we can't backport and change anything in 
the main codes until the release 2.3.0.

So, you are worried of delaying the release more because it has been 
delayed pretty much already? I understand this but I would like to ask to get 
this (whether it throws an exception for both `toPandas` and `createDataFrame` 
or fallback for both) in to make the new feature out with consistency please.

@rxin, do you think we should take this out in 2.3.0 too? Was this your 
opinion (https://github.com/apache/spark/pull/20567#issuecomment-365099515)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
RC3 is out. Just to avoid new regressions that might be introduced in the 
new PR. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
^ I am not saying that we should merge it now. I can do the opposite for 
`createDataFrame` given 
https://github.com/apache/spark/pull/20567#issuecomment-365100434 . My point is 
why it should be exclueded in 2.3.0 specifically while this can be considered 
as a backport - 
https://github.com/apache/spark/pull/20567#issuecomment-365077913



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
> Is there any specific worry from this change, that might shake the 2.3.0 
release speficially? In this way, we can't backport anything. I am surprised 
that this PR is considered to be excluded specifically in 2.3.0.

Yeah. This PR is not ready to merge yet.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
> I thought this is another step. We need to make them consistent first.

Based on the comments from @icexelloss , I do not think we should blindly 
switch back to the original version. At least, provide an option to the end 
users. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20567
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87355/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20567
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20567
  
**[Test build #87355 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87355/testReport)**
 for PR 20567 at commit 
[`42dec46`](https://github.com/apache/spark/commit/42dec467df4c332cedf474623243db7c929881d7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
The feedback is partially from @rxin Maybe he can provide more inputs 
later. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
> This issue does not cause the regression since 
spark.sql.execution.arrow.enabled is off by default. 

It doesn't block the release but we can still backport it because it fixes 
an actual bug fix with a minimal change whether 2.3.0 is released or not.

> We need to make it configurable before merging it

I thought this is another step. We need to make them consistent first.

>  Merging this to 2.3.0 might cause the regression and impacts the release 
date of SPARK 2.3

Is there any specific worry from this change, that might shake the 2.3.0 
release speficially? In this way, we can't backport anything. I am surprised 
that this PR is considered to be excluded specifically in 2.3.0.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
This issue does not cause the regression since 
`spark.sql.execution.arrow.enabled` is off by default. We need to make it 
configurable before merging it. Merging this to 2.3.0 might cause the 
regression and impacts the release date of SPARK 2.3. Thus, we would suggest to 
delay merging it until 2.3.0 is out. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20567
  
**[Test build #87355 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87355/testReport)**
 for PR 20567 at commit 
[`42dec46`](https://github.com/apache/spark/commit/42dec467df4c332cedf474623243db7c929881d7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20567
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20567
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/827/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20567
  
> My proposal is to merge the fix after the 2.3 release. We can still 
backport it to SPARK 2.3, but it will be available in SPARK 2.3.1.

Mind if I ask to elaborate why? Want to know why this one is specially 
excluded in 2.3.0 alone although it can be backported to branch-2.3.

I thought it's good to add it into 2.3.0 because this this is kind of safe, 
fixes a actual bug and matches the behaviour with `createDataFrame` too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20567: [SPARK-23380][PYTHON] Make toPandas fallback to non-Arro...

2018-02-12 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20567
  
My proposal is to merge the fix after the 2.3 release. We can still 
backport it to SPARK 2.3, but it will be available in SPARK 2.3.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org