R created SPARK-19503: ------------------------- Summary: Dumb Execution Plan Key: SPARK-19503 URL: https://issues.apache.org/jira/browse/SPARK-19503 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.1.0 Environment: Perhaps only a pyspark or databricks AWS issue Reporter: R Priority: Minor
df.sort(...).count() performs shuffle and sort and then count! This is wasteful as sort is not required here and makes me wonder how smart the algebraic optimiser is indeed! The data may be partitioned by known count (such as parquet files) and we should not shuffle to just perform count. This may look trivial, but if optimiser fails to recognise this, I wonder what else is it missing especially in more complex operations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org