[ https://issues.apache.org/jira/browse/SPARK-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-3095: ----------------------------- Target Version/s: (was: 1.2.0) > [PySpark] Speed up RDD.count() > ------------------------------ > > Key: SPARK-3095 > URL: https://issues.apache.org/jira/browse/SPARK-3095 > Project: Spark > Issue Type: Improvement > Components: PySpark > Reporter: Davies Liu > Assignee: Davies Liu > Priority: Minor > > RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not > PipelineRDD. > If the JavaRDD is serialized in batch mode, it's possible to skip the > deserialization of chunks (except the last one), because they all have the > same number of elements in them. There are some special cases that the chunks > are re-ordered, so this will not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org