Hi Josh,

The work around we figured out to solve network latency and out of memory 
problems with the toPandas method was to create Pandas DataFrames or Numpy 
Arrays using MapPartitions for each partition. Maybe a standard solution around 
this line of thought could be built. The integration is quite tedious ;)

I hope this helps.

Regards,
Mark

> On 22 Mar 2016, at 13:40, Josh Levy-Kramer <j...@starcount.com> wrote:
> 
> Hi,
> 
> A common pattern in my work is querying large tables in Spark DataFrames and 
> then needing to do more detailed analysis locally when the data can fit into 
> memory. However, i've hit a few blockers. In Scala no well developed 
> DataFrame library exists and in Python the `toPandas` function is very slow. 
> As Pandas is one of the best DataFrame libraries out there is may be worth 
> spending some time into making the `toPandas` method more efficient.
> 
> Having a quick look at the code it looks like a lot of iteration is occurring 
> on the Python side. Python is really slow at iterating over large loop and 
> this should be avoided. If iteration does have to occur its best done in 
> Cython. Has anyone looked at Cythonising the process? Or even better 
> serialising directly to Numpy arrays instead of the intermediate lists of 
> Rows.
> 
> Here are some links to the current code:
> 
> topandas: 
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342
>  
> <https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342>
> 
> collect: 
> https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233
>  
> <https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233>
> 
> _load_from_socket: 
> https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123
>  
> <https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123>
> 
> Josh Levy-Kramer
> Data Scientist @ Starcount

Reply via email to