Any response or pointers to understand how Cluster By in sub queries can
affect the performance/speed of outer queries is helpful.

Thanks
Kay


On Mon, Oct 28, 2013 at 1:17 PM, KayVajj <vajjalak...@gmail.com> wrote:

> Hi,
>
> I have a question if I could use the cluster by clause in a sub query to
> improve the performance of a group by query in hive
>
> Lets I have a Table A with columns (all strings) col1..col5 and the table
> is not "Clustered"
>
> now I 'm trying to run the below query
>
> select
>> col1,
>> col2,
>> col3,
>> col4,
>> concat_ws(',', collect_set(col5))
>> from A
>> group by
>> col1,
>> col2,
>> col3,
>> col4
>
>
>
> Would the below query optimize the above query and if not what is the best
> practice to optimize this query. Assuming only col1 & col2 are the uniquely
> identifying columns
>
>
>
>
> select
>> ct.col1,
>> ct.col2,
>> ct.col3,
>> ct.col4,
>> concat_ws(',', collect_set(ct.col5))
>> from
>> (
>> select
>> col1,
>> col2,
>> col3,
>> col4,
>> col5
>> from A
>> cluster by col1, col2
>> ) ct
>> group by
>> ct.col1,
>> ct.col2,
>> ct.col3,
>> ct.col4.
>
>
> Thanks for your responses.
>
>

Reply via email to