[ 
https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24650.
----------------------------------
    Resolution: Incomplete

> GroupingSet
> -----------
>
>                 Key: SPARK-24650
>                 URL: https://issues.apache.org/jira/browse/SPARK-24650
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>         Environment: CDH 5.X, Spark 2.3
>            Reporter: Mihir Sahu
>            Priority: Major
>              Labels: Grouping, Sets, bulk-closed
>
> If a grouping set is used in spark sql, then the plan does not perform 
> optimally.
> If input to a grouping set is X rows and the grouping sets has y group, then 
> the number of rows that are processed is currently x*y rows.
> Example : Let a Dataframe have  col1, col2, col3 and col4 columns and number 
> of row be rowNo.
> and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2
> Number of row processed in such case is 3*(rowNos * size of each row).
> However is this the optimal way of processing data.
> If the groups of y are derivable for each other, can we reduce the amount of 
> volume processed by removing columns as we progress to the lower dimension of 
> processing.
> Currently while doing processing percentile, a lot of data seems to be 
> processed causing performance issue.
> Need to look if this can be optimised



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to