[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow
[ https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102012#comment-17102012 ] Andrew Redd commented on ARROW-8731: Perfect thank you for the help! > Error when using toPandas with PyArrow > -- > > Key: ARROW-8731 > URL: https://issues.apache.org/jira/browse/ARROW-8731 > Project: Apache Arrow > Issue Type: Bug > Environment: Python Environment on the worker and driver > - jupyter==1.0.0 > - pandas==1.0.3 > - pyarrow==0.14.0 > - pyspark==2.4.0 > - py4j==0.10.7 > - pyarrow==0.14.0 >Reporter: Andrew Redd >Priority: Blocker > > I'm getting the following error when calling toPandas on a spark dataframe. I > imagine my pyspark and pyarrow versions are clashing somehow but I haven't > found this same issue by anyone else online > * This is a blocker to our use of pyarrow on a project > > {code:java} > --- > TypeError Traceback (most recent call last) > in > > 1 df.limit(100).toPandas() > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self) >2119 _check_dataframe_localize_timestamps >2120 import pyarrow > -> 2121 batches = self._collectAsArrow() >2122 if len(batches) > 0: >2123 table = pyarrow.Table.from_batches(batches) > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in > _collectAsArrow(self) >2177 with SCCallSiteSync(self._sc) as css: >2178 sock_info = self._jdf.collectAsArrowToPython() > -> 2179 return list(_load_from_socket(sock_info, > ArrowStreamSerializer())) >2180 >2181 > ## > /venv/lib/python3.6/site-packages/pyspark/rdd.py in > _load_from_socket(sock_info, serializer) > 142 > 143 def _load_from_socket(sock_info, serializer): > --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info) > 145 # The RDD materialization time is unpredicable, if we set a > timeout for socket reading > 146 # operation, it will very possibly fail. See SPARK-18281. > TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were > given > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow
[ https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101844#comment-17101844 ] Bryan Cutler commented on ARROW-8731: - [~are...@wayfair.com] you should be able to use a newer version of pyarrow with pyspark 2.4.4 by following the instructions here https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x > Error when using toPandas with PyArrow > -- > > Key: ARROW-8731 > URL: https://issues.apache.org/jira/browse/ARROW-8731 > Project: Apache Arrow > Issue Type: Bug > Environment: Python Environment on the worker and driver > - jupyter==1.0.0 > - pandas==1.0.3 > - pyarrow==0.14.0 > - pyspark==2.4.0 > - py4j==0.10.7 > - pyarrow==0.14.0 >Reporter: Andrew Redd >Priority: Blocker > > I'm getting the following error when calling toPandas on a spark dataframe. I > imagine my pyspark and pyarrow versions are clashing somehow but I haven't > found this same issue by anyone else online > * This is a blocker to our use of pyarrow on a project > > {code:java} > --- > TypeError Traceback (most recent call last) > in > > 1 df.limit(100).toPandas() > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self) >2119 _check_dataframe_localize_timestamps >2120 import pyarrow > -> 2121 batches = self._collectAsArrow() >2122 if len(batches) > 0: >2123 table = pyarrow.Table.from_batches(batches) > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in > _collectAsArrow(self) >2177 with SCCallSiteSync(self._sc) as css: >2178 sock_info = self._jdf.collectAsArrowToPython() > -> 2179 return list(_load_from_socket(sock_info, > ArrowStreamSerializer())) >2180 >2181 > ## > /venv/lib/python3.6/site-packages/pyspark/rdd.py in > _load_from_socket(sock_info, serializer) > 142 > 143 def _load_from_socket(sock_info, serializer): > --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info) > 145 # The RDD materialization time is unpredicable, if we set a > timeout for socket reading > 146 # operation, it will very possibly fail. See SPARK-18281. > TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were > given > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow
[ https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101823#comment-17101823 ] Wes McKinney commented on ARROW-8731: - cc [~bryanc] -- I think you need to set an environment variable > Error when using toPandas with PyArrow > -- > > Key: ARROW-8731 > URL: https://issues.apache.org/jira/browse/ARROW-8731 > Project: Apache Arrow > Issue Type: Bug > Environment: Python Environment on the worker and driver > - jupyter==1.0.0 > - pandas==1.0.3 > - pyarrow==0.14.0 > - pyspark==2.4.0 > - py4j==0.10.7 > - pyarrow==0.14.0 >Reporter: Andrew Redd >Priority: Blocker > > I'm getting the following error when calling toPandas on a spark dataframe. I > imagine my pyspark and pyarrow versions are clashing somehow but I haven't > found this same issue by anyone else online > * This is a blocker to our use of pyarrow on a project > > {code:java} > --- > TypeError Traceback (most recent call last) > in > > 1 df.limit(100).toPandas() > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self) >2119 _check_dataframe_localize_timestamps >2120 import pyarrow > -> 2121 batches = self._collectAsArrow() >2122 if len(batches) > 0: >2123 table = pyarrow.Table.from_batches(batches) > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in > _collectAsArrow(self) >2177 with SCCallSiteSync(self._sc) as css: >2178 sock_info = self._jdf.collectAsArrowToPython() > -> 2179 return list(_load_from_socket(sock_info, > ArrowStreamSerializer())) >2180 >2181 > ## > /venv/lib/python3.6/site-packages/pyspark/rdd.py in > _load_from_socket(sock_info, serializer) > 142 > 143 def _load_from_socket(sock_info, serializer): > --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info) > 145 # The RDD materialization time is unpredicable, if we set a > timeout for socket reading > 146 # operation, it will very possibly fail. See SPARK-18281. > TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were > given > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow
[ https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101807#comment-17101807 ] Andrew Redd commented on ARROW-8731: This was resolved when I moved to Pyspark 2.4.4 and pyarrow 0.8.0. I needed to make sure I had pandas < 1.0 > Error when using toPandas with PyArrow > -- > > Key: ARROW-8731 > URL: https://issues.apache.org/jira/browse/ARROW-8731 > Project: Apache Arrow > Issue Type: Bug > Environment: Python Environment on the worker and driver > - jupyter==1.0.0 > - pandas==1.0.3 > - pyarrow==0.14.0 > - pyspark==2.4.0 > - py4j==0.10.7 > - pyarrow==0.14.0 >Reporter: Andrew Redd >Priority: Blocker > > I'm getting the following error when calling toPandas on a spark dataframe. I > imagine my pyspark and pyarrow versions are clashing somehow but I haven't > found this same issue by anyone else online > * This is a blocker to our use of pyarrow on a project > > {code:java} > --- > TypeError Traceback (most recent call last) > in > > 1 df.limit(100).toPandas() > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self) >2119 _check_dataframe_localize_timestamps >2120 import pyarrow > -> 2121 batches = self._collectAsArrow() >2122 if len(batches) > 0: >2123 table = pyarrow.Table.from_batches(batches) > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in > _collectAsArrow(self) >2177 with SCCallSiteSync(self._sc) as css: >2178 sock_info = self._jdf.collectAsArrowToPython() > -> 2179 return list(_load_from_socket(sock_info, > ArrowStreamSerializer())) >2180 >2181 > ## > /venv/lib/python3.6/site-packages/pyspark/rdd.py in > _load_from_socket(sock_info, serializer) > 142 > 143 def _load_from_socket(sock_info, serializer): > --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info) > 145 # The RDD materialization time is unpredicable, if we set a > timeout for socket reading > 146 # operation, it will very possibly fail. See SPARK-18281. > TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were > given > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)