[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-12 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167685604
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
 timezone = None
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
+should_fall_back = False
 try:
-from pyspark.sql.types import 
_check_dataframe_convert_date, \
-_check_dataframe_localize_timestamps
+from pyspark.sql.types import to_arrow_schema
 from pyspark.sql.utils import 
require_minimum_pyarrow_version
-import pyarrow
 require_minimum_pyarrow_version()
+# Check if its schema is convertible in Arrow format.
+to_arrow_schema(self.schema)
+except Exception as e:
+# Fallback to convert to Pandas DataFrame without arrow if 
raise some exception
+should_fall_back = True
+warnings.warn(
+"Arrow will not be used in toPandas: %s" % 
_exception_message(e))
+
+if not should_fall_back:
+import pyarrow
+from pyspark.sql.types import 
_check_dataframe_convert_date, \
+_check_dataframe_localize_timestamps
+
 tables = self._collectAsArrow()
--- End diff --

Please see 
https://github.com/apache/spark/pull/20567#issuecomment-364846363. @ueshin 
raised a similar concern.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-12 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167644725
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
 timezone = None
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
+should_fall_back = False
 try:
-from pyspark.sql.types import 
_check_dataframe_convert_date, \
-_check_dataframe_localize_timestamps
+from pyspark.sql.types import to_arrow_schema
 from pyspark.sql.utils import 
require_minimum_pyarrow_version
-import pyarrow
 require_minimum_pyarrow_version()
+# Check if its schema is convertible in Arrow format.
+to_arrow_schema(self.schema)
+except Exception as e:
+# Fallback to convert to Pandas DataFrame without arrow if 
raise some exception
+should_fall_back = True
+warnings.warn(
+"Arrow will not be used in toPandas: %s" % 
_exception_message(e))
+
+if not should_fall_back:
+import pyarrow
+from pyspark.sql.types import 
_check_dataframe_convert_date, \
+_check_dataframe_localize_timestamps
+
 tables = self._collectAsArrow()
--- End diff --

for example, what if an executor doesn't have pyarrow installed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-12 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167635216
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
 timezone = None
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
+should_fall_back = False
 try:
-from pyspark.sql.types import 
_check_dataframe_convert_date, \
-_check_dataframe_localize_timestamps
+from pyspark.sql.types import to_arrow_schema
 from pyspark.sql.utils import 
require_minimum_pyarrow_version
-import pyarrow
 require_minimum_pyarrow_version()
+# Check if its schema is convertible in Arrow format.
+to_arrow_schema(self.schema)
+except Exception as e:
+# Fallback to convert to Pandas DataFrame without arrow if 
raise some exception
+should_fall_back = True
+warnings.warn(
+"Arrow will not be used in toPandas: %s" % 
_exception_message(e))
+
+if not should_fall_back:
+import pyarrow
+from pyspark.sql.types import 
_check_dataframe_convert_date, \
+_check_dataframe_localize_timestamps
+
 tables = self._collectAsArrow()
--- End diff --

shouldn't this be in the `try` block?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167469258
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -48,12 +49,12 @@
 else:
 import unittest
 
+from pyspark.util import _exception_message
--- End diff --

nit: add an empty line between this import and 
`_pandas_requirement_message` line.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167423077
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
 timezone = None
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
+should_fall_back = False
 try:
-from pyspark.sql.types import 
_check_dataframe_convert_date, \
-_check_dataframe_localize_timestamps
+from pyspark.sql.types import to_arrow_schema
 from pyspark.sql.utils import 
require_minimum_pyarrow_version
-import pyarrow
 require_minimum_pyarrow_version()
+# Check if its schema is convertible in Arrow format.
+to_arrow_schema(self.schema)
+except Exception as e:
+# Fallback to convert to Pandas DataFrame without arrow if 
raise some exception
--- End diff --

Yup. It does fall back for unsupported schema, PyArrow version mismatch and 
PyAarrow missing. Will add a note in PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-10 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167415761
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
 timezone = None
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
+should_fall_back = False
 try:
-from pyspark.sql.types import 
_check_dataframe_convert_date, \
-_check_dataframe_localize_timestamps
+from pyspark.sql.types import to_arrow_schema
 from pyspark.sql.utils import 
require_minimum_pyarrow_version
-import pyarrow
 require_minimum_pyarrow_version()
+# Check if its schema is convertible in Arrow format.
+to_arrow_schema(self.schema)
+except Exception as e:
+# Fallback to convert to Pandas DataFrame without arrow if 
raise some exception
--- End diff --

Does this PR fall back to the original path if any exception occurs? E.g. 
`ImportError` happens while the current code throws an exception with the 
message?
Would it be good to note this change, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167394432
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1941,12 +1941,24 @@ def toPandas(self):
 timezone = None
 
 if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", 
"false").lower() == "true":
+should_fall_back = False
--- End diff --

Here is the main change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-10 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20567#discussion_r167394424
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -1955,38 +1967,34 @@ def toPandas(self):
 return _check_dataframe_localize_timestamps(pdf, 
timezone)
 else:
 return pd.DataFrame.from_records([], 
columns=self.columns)
-except ImportError as e:
-msg = "note: pyarrow must be installed and available on 
calling Python process " \
-  "if using spark.sql.execution.arrow.enabled=true"
-raise ImportError("%s\n%s" % (_exception_message(e), msg))
-else:
-pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
 
-dtype = {}
+pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
--- End diff --

Actual diff here is just `else:`. It was removed and it fixes the 
indentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...

2018-02-10 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20567

[SPARK-23380][PYTHON] Make toPandas fall back to Arrow optimization 
disabled when schema is not supported in Arrow optimization

## What changes were proposed in this pull request?

This PR proposes to fall back to one without Arrow when schema is not 
supported in Arrow optimisation.

```python
df = spark.createDataFrame([[{'a': 1}]])

spark.conf.set("spark.sql.execution.arrow.enabled", "false")
df.toPandas()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
df.toPandas()
```

**Before**

```
...
py4j.protocol.Py4JJavaError: An error occurred while calling 
o42.collectAsArrowToPython.
...
java.lang.UnsupportedOperationException: Unsupported data type: 
map
```

**After**

```
...
  _1
0  {u'a': 1}

... UserWarning: Arrow will not be used in toPandas: Unsupported type in 
conversion to Arrow: MapType(StringType,LongType,true)
...
  _1
0  {u'a': 1}
```

Note that, in case of `createDataFrame`, we already fall back to make this 
at least working even though the optimisation is disabled: 

```python
df = spark.createDataFrame([[{'a': 1}]])
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
pdf = df.toPandas()
spark.createDataFrame(pdf).show()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.createDataFrame(pdf).show()
```

```
...
... UserWarning: Arrow will not be used in createDataFrame: Error inferring 
Arrow type ...
++
|  _1|
++
|[a -> 1]|
++
```


## How was this patch tested?

Manually tested and unit tests were added in `python/pyspark/sql/tests.py`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark pandas_conversion_cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20567.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20567


commit d87547c05c0ab874dfce8e6ddca4ee454926b664
Author: hyukjinkwon 
Date:   2018-02-09T03:40:41Z

toPandas conversion cleanup




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org