[ 
https://issues.apache.org/jira/browse/HIVE-20141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar reassigned HIVE-20141:
-----------------------------------


> Turn hive.spark.use.groupby.shuffle off by default
> --------------------------------------------------
>
>                 Key: HIVE-20141
>                 URL: https://issues.apache.org/jira/browse/HIVE-20141
>             Project: Hive
>          Issue Type: Task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>
> [~xuefuz] any thoughts on this? I think it would provide better out of the 
> box behavior for Hive-on-Spark users, especially for users who are migrating 
> from Hive-on-MR to HoS. Wondering what your experience with this config has 
> been?
> I've done a bunch of performance profiling with this config turned on vs. 
> off, and for TPC-DS queries it doesn't make a significant difference. The 
> main difference I can see is that when a Spark stage has to spill to disk, 
> {{repartitionAndSortWithinPartitions}} spills more data to disk than 
> {{groupByKey}} - my guess is that this happens because {{groupByKey}} stores 
> everything in Spark's {{ExternalAppendOnlyMap}} (which only stores a single 
> copy of the key for potentially multiple values) whereas 
> {{repartitionAndSortWithinPartitions}} uses Spark's {{ExternalSorter}} which 
> sorts all the K, V pairs (and thus doesn't de-duplicate keys, which results 
> in more data being spilled to disk).
> My understanding is that using {{repartitionAndSortWithinPartitions}} for 
> Hive GROUP BYs is similar to what Hive-on-MR does. So disabling this config 
> would provide a similar experience to HoMR. Furthermore, last I checked, 
> {{groupByKey}} still can't spill within a row group.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to