Bryan Cutler created SPARK-23030: ------------------------------------ Summary: Decrease memory consumption with toPandas() collection using Arrow Key: SPARK-23030 URL: https://issues.apache.org/jira/browse/SPARK-23030 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Bryan Cutler
Currently with Arrow enabled, calling {{toPandas()}} results in a collection of all partitions in the JVM in the form of batches of Arrow file format. Once collected in the JVM, they are served to the Python driver process. I believe using the Arrow stream format can help to optimize this and reduce memory consumption in the JVM by only loading one record batch at a time before sending it to Python. This might also reduce the latency between making the initial call in Python and receiving the first batch of records. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org