[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167685604 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1941,12 +1941,24 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fall_back = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version -import pyarrow require_minimum_pyarrow_version() +# Check if its schema is convertible in Arrow format. +to_arrow_schema(self.schema) +except Exception as e: +# Fallback to convert to Pandas DataFrame without arrow if raise some exception +should_fall_back = True +warnings.warn( +"Arrow will not be used in toPandas: %s" % _exception_message(e)) + +if not should_fall_back: +import pyarrow +from pyspark.sql.types import _check_dataframe_convert_date, \ +_check_dataframe_localize_timestamps + tables = self._collectAsArrow() --- End diff -- Please see https://github.com/apache/spark/pull/20567#issuecomment-364846363. @ueshin raised a similar concern. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167644725 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1941,12 +1941,24 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fall_back = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version -import pyarrow require_minimum_pyarrow_version() +# Check if its schema is convertible in Arrow format. +to_arrow_schema(self.schema) +except Exception as e: +# Fallback to convert to Pandas DataFrame without arrow if raise some exception +should_fall_back = True +warnings.warn( +"Arrow will not be used in toPandas: %s" % _exception_message(e)) + +if not should_fall_back: +import pyarrow +from pyspark.sql.types import _check_dataframe_convert_date, \ +_check_dataframe_localize_timestamps + tables = self._collectAsArrow() --- End diff -- for example, what if an executor doesn't have pyarrow installed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167635216 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1941,12 +1941,24 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fall_back = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version -import pyarrow require_minimum_pyarrow_version() +# Check if its schema is convertible in Arrow format. +to_arrow_schema(self.schema) +except Exception as e: +# Fallback to convert to Pandas DataFrame without arrow if raise some exception +should_fall_back = True +warnings.warn( +"Arrow will not be used in toPandas: %s" % _exception_message(e)) + +if not should_fall_back: +import pyarrow +from pyspark.sql.types import _check_dataframe_convert_date, \ +_check_dataframe_localize_timestamps + tables = self._collectAsArrow() --- End diff -- shouldn't this be in the `try` block? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167469258 --- Diff: python/pyspark/sql/tests.py --- @@ -48,12 +49,12 @@ else: import unittest +from pyspark.util import _exception_message --- End diff -- nit: add an empty line between this import and `_pandas_requirement_message` line. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167423077 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1941,12 +1941,24 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fall_back = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version -import pyarrow require_minimum_pyarrow_version() +# Check if its schema is convertible in Arrow format. +to_arrow_schema(self.schema) +except Exception as e: +# Fallback to convert to Pandas DataFrame without arrow if raise some exception --- End diff -- Yup. It does fall back for unsupported schema, PyArrow version mismatch and PyAarrow missing. Will add a note in PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167415761 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1941,12 +1941,24 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fall_back = False try: -from pyspark.sql.types import _check_dataframe_convert_date, \ -_check_dataframe_localize_timestamps +from pyspark.sql.types import to_arrow_schema from pyspark.sql.utils import require_minimum_pyarrow_version -import pyarrow require_minimum_pyarrow_version() +# Check if its schema is convertible in Arrow format. +to_arrow_schema(self.schema) +except Exception as e: +# Fallback to convert to Pandas DataFrame without arrow if raise some exception --- End diff -- Does this PR fall back to the original path if any exception occurs? E.g. `ImportError` happens while the current code throws an exception with the message? Would it be good to note this change, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167394432 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1941,12 +1941,24 @@ def toPandas(self): timezone = None if self.sql_ctx.getConf("spark.sql.execution.arrow.enabled", "false").lower() == "true": +should_fall_back = False --- End diff -- Here is the main change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20567#discussion_r167394424 --- Diff: python/pyspark/sql/dataframe.py --- @@ -1955,38 +1967,34 @@ def toPandas(self): return _check_dataframe_localize_timestamps(pdf, timezone) else: return pd.DataFrame.from_records([], columns=self.columns) -except ImportError as e: -msg = "note: pyarrow must be installed and available on calling Python process " \ - "if using spark.sql.execution.arrow.enabled=true" -raise ImportError("%s\n%s" % (_exception_message(e), msg)) -else: -pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) -dtype = {} +pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) --- End diff -- Actual diff here is just `else:`. It was removed and it fixes the indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20567: [SPARK-23380][PYTHON] Make toPandas fall back to ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/20567 [SPARK-23380][PYTHON] Make toPandas fall back to Arrow optimization disabled when schema is not supported in Arrow optimization ## What changes were proposed in this pull request? This PR proposes to fall back to one without Arrow when schema is not supported in Arrow optimisation. ```python df = spark.createDataFrame([[{'a': 1}]]) spark.conf.set("spark.sql.execution.arrow.enabled", "false") df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", "true") df.toPandas() ``` **Before** ``` ... py4j.protocol.Py4JJavaError: An error occurred while calling o42.collectAsArrowToPython. ... java.lang.UnsupportedOperationException: Unsupported data type: map``` **After** ``` ... _1 0 {u'a': 1} ... UserWarning: Arrow will not be used in toPandas: Unsupported type in conversion to Arrow: MapType(StringType,LongType,true) ... _1 0 {u'a': 1} ``` Note that, in case of `createDataFrame`, we already fall back to make this at least working even though the optimisation is disabled: ```python df = spark.createDataFrame([[{'a': 1}]]) spark.conf.set("spark.sql.execution.arrow.enabled", "false") pdf = df.toPandas() spark.createDataFrame(pdf).show() spark.conf.set("spark.sql.execution.arrow.enabled", "true") spark.createDataFrame(pdf).show() ``` ``` ... ... UserWarning: Arrow will not be used in createDataFrame: Error inferring Arrow type ... ++ | _1| ++ |[a -> 1]| ++ ``` ## How was this patch tested? Manually tested and unit tests were added in `python/pyspark/sql/tests.py`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark pandas_conversion_cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20567.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20567 commit d87547c05c0ab874dfce8e6ddca4ee454926b664 Author: hyukjinkwon Date: 2018-02-09T03:40:41Z toPandas conversion cleanup --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org