Andrew Sherman created HIVE-17677:
-------------------------------------

             Summary: Investigate using hive statistics information to optimize 
HoS parallel order by
                 Key: HIVE-17677
                 URL: https://issues.apache.org/jira/browse/HIVE-17677
             Project: Hive
          Issue Type: Improvement
    Affects Versions: 3.0.0
            Reporter: Andrew Sherman
            Assignee: Andrew Sherman


I think Spark's native parallel order by works in a similar way to what we do 
for Hive-on-MR.  That is, it scans the RDD once and sample the data to 
determine what ranges the data should be partitioned into, and then scans the 
RDD again to do the actual order by (with multiple reducers). 

One optimization suggested by [~stakiar] is that if we have column stats about 
the col we are ordering by, then the first scan on the RDD is not necessary. If 
we have histogram data about the RDD, we already know what the ranges of the 
order by should be. This should work when running parallel order by on simple 
tables, will be harder when we run it on derived datasets (although not 
impossible). 

To do his we would have to understand more about the internals of JavaPairRDD. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to