Andrew Sherman created HIVE-17677: ------------------------------------- Summary: Investigate using hive statistics information to optimize HoS parallel order by Key: HIVE-17677 URL: https://issues.apache.org/jira/browse/HIVE-17677 Project: Hive Issue Type: Improvement Affects Versions: 3.0.0 Reporter: Andrew Sherman Assignee: Andrew Sherman
I think Spark's native parallel order by works in a similar way to what we do for Hive-on-MR. That is, it scans the RDD once and sample the data to determine what ranges the data should be partitioned into, and then scans the RDD again to do the actual order by (with multiple reducers). One optimization suggested by [~stakiar] is that if we have column stats about the col we are ordering by, then the first scan on the RDD is not necessary. If we have histogram data about the RDD, we already know what the ranges of the order by should be. This should work when running parallel order by on simple tables, will be harder when we run it on derived datasets (although not impossible). To do his we would have to understand more about the internals of JavaPairRDD. -- This message was sent by Atlassian JIRA (v6.4.14#64029)