Maybe this improvement could help. It's available since Impala 2.9. https://issues.apache.org/jira/browse/IMPALA-5381
On Fri, Feb 23, 2018 at 6:40 PM, Arya Goudarzi <gouda...@gmail.com> wrote: > Thank you Mostafa. My bad on mentioning the wrong version. We are using > 2.7 and not 1.7. We have upgrade in our plans and actually waiting for > Impala 2.12 as it has IMPALA-5058 fixes. > > On Fri, Feb 23, 2018 at 6:18 PM, Mostafa Mokhtar <mmokh...@cloudera.com> > wrote: > >> AFAIK there is no such flag. >> You are more likely to get much higher gains if you upgrade to a more >> recent version of Impala. >> >> https://www.slideshare.net/cloudera/performance-of-apache-impala >> >> Thanks >> Mostafa >> >> On Feb 23, 2018, at 6:12 PM, Arya Goudarzi <gouda...@gmail.com> wrote: >> >> Hi Team, >> >> TL;DR; I am wondering if there is a way to instruct Impala to use >> shuffle by default for all join queries as my research didn't end anywhere >> so far. >> >> We have a multi PiB cluster with hundreds of thousand of partitions. We >> are using Impala 1.7 with HDFS. Due to our cluster size, compute_stats, and >> compute_incremental_stats are not feasible for us as compute_stats seems a >> heavy operation on a lot of our large tables and destabilizes the cluster, >> and with compute_incremental_stats we hit IMPALA-2648 >> <https://issues.apache.org/jira/browse/IMPALA-2648>. >> >> Therefore, to optimize our queries we need to add [shuffle] hint to the >> queries with joins, and we have seen that this improves performance 3x on >> simple tests because the system doesn't have to stream too much data and >> dump it for broadcast join. >> >> We have a large team of analysts who are pushing tons of queries to the >> system. It is hard to enforce policy at the moment for them to remember to >> use shuffle hint so it doesn't take our system down. >> >> -- >> Cheers, >> -Arya >> >> > > > -- > Cheers, > -Arya >