[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow

2020-05-07 Thread Andrew Redd (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102012#comment-17102012
 ] 

Andrew Redd commented on ARROW-8731:


Perfect thank you for the help!

> Error when using toPandas with PyArrow
> --
>
> Key: ARROW-8731
> URL: https://issues.apache.org/jira/browse/ARROW-8731
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Python Environment on the worker and driver
> - jupyter==1.0.0
> - pandas==1.0.3
> - pyarrow==0.14.0
> - pyspark==2.4.0
> - py4j==0.10.7
> - pyarrow==0.14.0
>Reporter: Andrew Redd
>Priority: Blocker
>
> I'm getting the following error when calling toPandas on a spark dataframe. I 
> imagine my pyspark and pyarrow versions are clashing somehow but I haven't 
> found this same issue by anyone else online
>  * This is a blocker to our use of pyarrow on a project
>  
> {code:java}
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 df.limit(100).toPandas()
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
>2119 _check_dataframe_localize_timestamps
>2120 import pyarrow
> -> 2121 batches = self._collectAsArrow()
>2122 if len(batches) > 0:
>2123 table = pyarrow.Table.from_batches(batches)
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in 
> _collectAsArrow(self)
>2177 with SCCallSiteSync(self._sc) as css:
>2178 sock_info = self._jdf.collectAsArrowToPython()
> -> 2179 return list(_load_from_socket(sock_info, 
> ArrowStreamSerializer()))
>2180 
>2181 
> ##
> /venv/lib/python3.6/site-packages/pyspark/rdd.py in 
> _load_from_socket(sock_info, serializer)
> 142 
> 143 def _load_from_socket(sock_info, serializer):
> --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info)
> 145 # The RDD materialization time is unpredicable, if we set a 
> timeout for socket reading
> 146 # operation, it will very possibly fail. See SPARK-18281.
> TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were 
> given
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow

2020-05-07 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101844#comment-17101844
 ] 

Bryan Cutler commented on ARROW-8731:
-

[~are...@wayfair.com] you should be able to use a newer version of pyarrow with 
pyspark 2.4.4 by following the instructions here 
https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x

> Error when using toPandas with PyArrow
> --
>
> Key: ARROW-8731
> URL: https://issues.apache.org/jira/browse/ARROW-8731
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Python Environment on the worker and driver
> - jupyter==1.0.0
> - pandas==1.0.3
> - pyarrow==0.14.0
> - pyspark==2.4.0
> - py4j==0.10.7
> - pyarrow==0.14.0
>Reporter: Andrew Redd
>Priority: Blocker
>
> I'm getting the following error when calling toPandas on a spark dataframe. I 
> imagine my pyspark and pyarrow versions are clashing somehow but I haven't 
> found this same issue by anyone else online
>  * This is a blocker to our use of pyarrow on a project
>  
> {code:java}
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 df.limit(100).toPandas()
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
>2119 _check_dataframe_localize_timestamps
>2120 import pyarrow
> -> 2121 batches = self._collectAsArrow()
>2122 if len(batches) > 0:
>2123 table = pyarrow.Table.from_batches(batches)
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in 
> _collectAsArrow(self)
>2177 with SCCallSiteSync(self._sc) as css:
>2178 sock_info = self._jdf.collectAsArrowToPython()
> -> 2179 return list(_load_from_socket(sock_info, 
> ArrowStreamSerializer()))
>2180 
>2181 
> ##
> /venv/lib/python3.6/site-packages/pyspark/rdd.py in 
> _load_from_socket(sock_info, serializer)
> 142 
> 143 def _load_from_socket(sock_info, serializer):
> --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info)
> 145 # The RDD materialization time is unpredicable, if we set a 
> timeout for socket reading
> 146 # operation, it will very possibly fail. See SPARK-18281.
> TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were 
> given
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow

2020-05-07 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101823#comment-17101823
 ] 

Wes McKinney commented on ARROW-8731:
-

cc [~bryanc] -- I think you need to set an environment variable

> Error when using toPandas with PyArrow
> --
>
> Key: ARROW-8731
> URL: https://issues.apache.org/jira/browse/ARROW-8731
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Python Environment on the worker and driver
> - jupyter==1.0.0
> - pandas==1.0.3
> - pyarrow==0.14.0
> - pyspark==2.4.0
> - py4j==0.10.7
> - pyarrow==0.14.0
>Reporter: Andrew Redd
>Priority: Blocker
>
> I'm getting the following error when calling toPandas on a spark dataframe. I 
> imagine my pyspark and pyarrow versions are clashing somehow but I haven't 
> found this same issue by anyone else online
>  * This is a blocker to our use of pyarrow on a project
>  
> {code:java}
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 df.limit(100).toPandas()
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
>2119 _check_dataframe_localize_timestamps
>2120 import pyarrow
> -> 2121 batches = self._collectAsArrow()
>2122 if len(batches) > 0:
>2123 table = pyarrow.Table.from_batches(batches)
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in 
> _collectAsArrow(self)
>2177 with SCCallSiteSync(self._sc) as css:
>2178 sock_info = self._jdf.collectAsArrowToPython()
> -> 2179 return list(_load_from_socket(sock_info, 
> ArrowStreamSerializer()))
>2180 
>2181 
> ##
> /venv/lib/python3.6/site-packages/pyspark/rdd.py in 
> _load_from_socket(sock_info, serializer)
> 142 
> 143 def _load_from_socket(sock_info, serializer):
> --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info)
> 145 # The RDD materialization time is unpredicable, if we set a 
> timeout for socket reading
> 146 # operation, it will very possibly fail. See SPARK-18281.
> TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were 
> given
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow

2020-05-07 Thread Andrew Redd (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101807#comment-17101807
 ] 

Andrew Redd commented on ARROW-8731:


This was resolved when I moved to Pyspark 2.4.4 and pyarrow 0.8.0. 

 

I needed to make sure I had pandas < 1.0

> Error when using toPandas with PyArrow
> --
>
> Key: ARROW-8731
> URL: https://issues.apache.org/jira/browse/ARROW-8731
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Python Environment on the worker and driver
> - jupyter==1.0.0
> - pandas==1.0.3
> - pyarrow==0.14.0
> - pyspark==2.4.0
> - py4j==0.10.7
> - pyarrow==0.14.0
>Reporter: Andrew Redd
>Priority: Blocker
>
> I'm getting the following error when calling toPandas on a spark dataframe. I 
> imagine my pyspark and pyarrow versions are clashing somehow but I haven't 
> found this same issue by anyone else online
>  * This is a blocker to our use of pyarrow on a project
>  
> {code:java}
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 df.limit(100).toPandas()
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
>2119 _check_dataframe_localize_timestamps
>2120 import pyarrow
> -> 2121 batches = self._collectAsArrow()
>2122 if len(batches) > 0:
>2123 table = pyarrow.Table.from_batches(batches)
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in 
> _collectAsArrow(self)
>2177 with SCCallSiteSync(self._sc) as css:
>2178 sock_info = self._jdf.collectAsArrowToPython()
> -> 2179 return list(_load_from_socket(sock_info, 
> ArrowStreamSerializer()))
>2180 
>2181 
> ##
> /venv/lib/python3.6/site-packages/pyspark/rdd.py in 
> _load_from_socket(sock_info, serializer)
> 142 
> 143 def _load_from_socket(sock_info, serializer):
> --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info)
> 145 # The RDD materialization time is unpredicable, if we set a 
> timeout for socket reading
> 146 # operation, it will very possibly fail. See SPARK-18281.
> TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were 
> given
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)