Hi Team, TL;DR; I am wondering if there is a way to instruct Impala to use shuffle by default for all join queries as my research didn't end anywhere so far.
We have a multi PiB cluster with hundreds of thousand of partitions. We are using Impala 1.7 with HDFS. Due to our cluster size, compute_stats, and compute_incremental_stats are not feasible for us as compute_stats seems a heavy operation on a lot of our large tables and destabilizes the cluster, and with compute_incremental_stats we hit IMPALA-2648 <https://issues.apache.org/jira/browse/IMPALA-2648>. Therefore, to optimize our queries we need to add [shuffle] hint to the queries with joins, and we have seen that this improves performance 3x on simple tests because the system doesn't have to stream too much data and dump it for broadcast join. We have a large team of analysts who are pushing tons of queries to the system. It is hard to enforce policy at the moment for them to remember to use shuffle hint so it doesn't take our system down. -- Cheers, -Arya
