[ https://issues.apache.org/jira/browse/SPARK-24650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523351#comment-16523351 ]
Hyukjin Kwon commented on SPARK-24650: -------------------------------------- Please avoid to set a blocker which is usually reserved for a committer. > GroupingSet > ----------- > > Key: SPARK-24650 > URL: https://issues.apache.org/jira/browse/SPARK-24650 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Environment: CDH 5.X, Spark 2.3 > Reporter: Mihir Sahu > Priority: Major > Labels: Grouping, Sets > > If a grouping set is used in spark sql, then the plan does not perform > optimally. > If input to a grouping set is X rows and the grouping sets has y group, then > the number of rows that are processed is currently x*y rows. > Example : Let a Dataframe have col1, col2, col3 and col4 columns and number > of row be rowNo. > and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2 > Number of row processed in such case is 3*(rowNos * size of each row). > However is this the optimal way of processing data. > If the groups of y are derivable for each other, can we reduce the amount of > volume processed by removing columns as we progress to the lower dimension of > processing. > Currently while doing processing percentile, a lot of data seems to be > processed causing performance issue. > Need to look if this can be optimised -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org