[ https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bryan Cutler resolved SPARK-25274. ---------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22275 https://github.com/apache/spark/pull/22275 > Improve toPandas with Arrow by sending out-of-order record batches > ------------------------------------------------------------------ > > Key: SPARK-25274 > URL: https://issues.apache.org/jira/browse/SPARK-25274 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL > Affects Versions: 2.4.0 > Reporter: Bryan Cutler > Assignee: Bryan Cutler > Priority: Major > Fix For: 3.0.0 > > > When executing {{toPandas}} with Arrow enabled, partitions that arrive in the > JVM out-of-order must be buffered before they can be send to Python. This > causes an excess of memory to be used in the driver JVM and increases the > time it takes to complete because data must sit in the JVM waiting for > preceding partitions to come in. > This can be improved by sending out-of-order partitions to Python as soon as > they arrive in the JVM, followed by a list of indices so that Python can > assemble the data in the correct order. This way, data is not buffered at the > JVM and there is no waiting on particular partitions so performance will be > increased. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org