Any response or pointers to understand how Cluster By in sub queries can affect the performance/speed of outer queries is helpful.
Thanks Kay On Mon, Oct 28, 2013 at 1:17 PM, KayVajj <vajjalak...@gmail.com> wrote: > Hi, > > I have a question if I could use the cluster by clause in a sub query to > improve the performance of a group by query in hive > > Lets I have a Table A with columns (all strings) col1..col5 and the table > is not "Clustered" > > now I 'm trying to run the below query > > select >> col1, >> col2, >> col3, >> col4, >> concat_ws(',', collect_set(col5)) >> from A >> group by >> col1, >> col2, >> col3, >> col4 > > > > Would the below query optimize the above query and if not what is the best > practice to optimize this query. Assuming only col1 & col2 are the uniquely > identifying columns > > > > > select >> ct.col1, >> ct.col2, >> ct.col3, >> ct.col4, >> concat_ws(',', collect_set(ct.col5)) >> from >> ( >> select >> col1, >> col2, >> col3, >> col4, >> col5 >> from A >> cluster by col1, col2 >> ) ct >> group by >> ct.col1, >> ct.col2, >> ct.col3, >> ct.col4. > > > Thanks for your responses. > >