TL;DR; I am wondering if there is a way to instruct Impala to use shuffle
by default for all join queries as my research didn't end anywhere so far.
We have a multi PiB cluster with hundreds of thousand of partitions. We are
using Impala 1.7 with HDFS. Due to our cluster size, compute_stats, and
compute_incremental_stats are not feasible for us as compute_stats seems a
heavy operation on a lot of our large tables and destabilizes the cluster,
and with compute_incremental_stats we hit IMPALA-2648
Therefore, to optimize our queries we need to add [shuffle] hint to the
queries with joins, and we have seen that this improves performance 3x on
simple tests because the system doesn't have to stream too much data and
dump it for broadcast join.
We have a large team of analysts who are pushing tons of queries to the
system. It is hard to enforce policy at the moment for them to remember to
use shuffle hint so it doesn't take our system down.