R created SPARK-19503:
-------------------------

             Summary: Dumb Execution Plan
                 Key: SPARK-19503
                 URL: https://issues.apache.org/jira/browse/SPARK-19503
             Project: Spark
          Issue Type: Bug
          Components: Optimizer
    Affects Versions: 2.1.0
         Environment: Perhaps only a pyspark or databricks AWS issue
            Reporter: R
            Priority: Minor


df.sort(...).count()
performs shuffle and sort and then count! This is wasteful as sort is not 
required here and makes me wonder how smart the algebraic optimiser is indeed! 
The data may be partitioned by known count (such as parquet files) and we 
should not shuffle to just perform count.

This may look trivial, but if optimiser fails to recognise this, I wonder what 
else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to