Btw, you should also know that the following improvements in the upcoming
2.12 release might make "compute stats" more palatable on your huge tables.
We'd love your feedback on COMPUTE STATS with TABLESAMPLE, in particular.
COMPUTE STATS with TABLESAMPLE
COMPUTE STATS on a subset of columns
The following improvement should allow you to COMPUTE STATS less frequently
by extrapolating the row count of partitions that were added or modified
since the last COMPUTE STATS.
In general, would be great to get your feedback/ideas on how to make
computing stats more practical for you.
On Fri, Feb 23, 2018 at 8:23 PM, Alexander Behm <alex.b...@cloudera.com>
> Maybe this improvement could help. It's available since Impala 2.9.
> On Fri, Feb 23, 2018 at 6:40 PM, Arya Goudarzi <gouda...@gmail.com> wrote:
>> Thank you Mostafa. My bad on mentioning the wrong version. We are using
>> 2.7 and not 1.7. We have upgrade in our plans and actually waiting for
>> Impala 2.12 as it has IMPALA-5058 fixes.
>> On Fri, Feb 23, 2018 at 6:18 PM, Mostafa Mokhtar <mmokh...@cloudera.com>
>>> AFAIK there is no such flag.
>>> You are more likely to get much higher gains if you upgrade to a more
>>> recent version of Impala.
>>> On Feb 23, 2018, at 6:12 PM, Arya Goudarzi <gouda...@gmail.com> wrote:
>>> Hi Team,
>>> TL;DR; I am wondering if there is a way to instruct Impala to use
>>> shuffle by default for all join queries as my research didn't end anywhere
>>> so far.
>>> We have a multi PiB cluster with hundreds of thousand of partitions. We
>>> are using Impala 1.7 with HDFS. Due to our cluster size, compute_stats, and
>>> compute_incremental_stats are not feasible for us as compute_stats seems a
>>> heavy operation on a lot of our large tables and destabilizes the cluster,
>>> and with compute_incremental_stats we hit IMPALA-2648
>>> Therefore, to optimize our queries we need to add [shuffle] hint to the
>>> queries with joins, and we have seen that this improves performance 3x on
>>> simple tests because the system doesn't have to stream too much data and
>>> dump it for broadcast join.
>>> We have a large team of analysts who are pushing tons of queries to the
>>> system. It is hard to enforce policy at the moment for them to remember to
>>> use shuffle hint so it doesn't take our system down.