[ https://issues.apache.org/jira/browse/HIVE-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sahil Takiar reassigned HIVE-20141: ----------------------------------- > Turn hive.spark.use.groupby.shuffle off by default > -------------------------------------------------- > > Key: HIVE-20141 > URL: https://issues.apache.org/jira/browse/HIVE-20141 > Project: Hive > Issue Type: Task > Components: Spark > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Priority: Major > > [~xuefuz] any thoughts on this? I think it would provide better out of the > box behavior for Hive-on-Spark users, especially for users who are migrating > from Hive-on-MR to HoS. Wondering what your experience with this config has > been? > I've done a bunch of performance profiling with this config turned on vs. > off, and for TPC-DS queries it doesn't make a significant difference. The > main difference I can see is that when a Spark stage has to spill to disk, > {{repartitionAndSortWithinPartitions}} spills more data to disk than > {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores > everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single > copy of the key for potentially multiple values) whereas > {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which > sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results > in more data being spilled to disk). > My understanding is that using {{repartitionAndSortWithinPartitions}} for > Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config > would provide a similar experience to HoMR. Furthermore, last I checked, > {{groupByKey}} still can't spill within a row group. -- This message was sent by Atlassian JIRA (v7.6.3#76005)