Hi Piyush,

If you have rough understanding of NDV and row count you can set them
manually via alter table command below:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html#perf_column_stats_manual

There are efforts to add sampled statistics support to Impala which should
help in your case.

Thanks
Mostafa


On Thu, Jan 4, 2018 at 10:26 AM, Piyush Narang <[email protected]> wrote:

> Seems like my attempt to compute full stats for this table failed as well.
> Like Mostafa pointed out the bulk of the overhead was indeed in the select
> ndv(c1), … query. The query ends up spending over 5 hours there.
> Unfortunately, it seems to fail on the Hive metastore update. Digging into
> that with some folks on our end.
>
>
>
> -- Piyush
>
>
>
>
>
> *From: *Piyush Narang <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Wednesday, January 3, 2018 at 2:49 PM
>
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Issues running compute incremental stats in Impala - alter
> not possible
>
>
>
> Thanks for pointing this out. Kicked off a run of this. Shall get back
> with breakdowns and how it goes.
>
>
>
> -- Piyush
>
>
>
>
>
> *From: *Mostafa Mokhtar <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Wednesday, January 3, 2018 at 2:13 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Issues running compute incremental stats in Impala - alter
> not possible
>
>
>
> When you looked at the Web-UI while compute stats was running what did you
> find?
>
>
>
> In general compute stats has 3 phases
>
>    1. Get row count per partition using something like select count(*),
>    partition_id from foo group by partition_id
>    2. SELECT NDV(C1), Max size (C1), Avg size (C1), ... NDV(CN), Max size
>    (CN), Avg size (CN) from foo
>    3. Persist the captured stats in Hive Meta store
>
>
>
> For steps #1 and #2 you should be able to get a good read on progress
> using the "Scan Progress" column.
>
> Step 3 if you tail /var/log/catalogd/catalogd.INFO you should see the
> progress there.
>
>
>
> And if the table is large in terms of on disk size I expect steps #2 to
> dominate the time.
>
>
>
> If the operation is CPU not IO bound increasing mt_dop should give you
> good speedup, yet I don't recommend a value greater than 16.
>
>
>
>
>
> On Wed, Jan 3, 2018 at 11:03 AM, Piyush Narang <[email protected]>
> wrote:
>
> Thanks Alex and Mostafa. I tried running compute stats full a couple of
> weeks back on this table and it was still going 4 hours later (and I didn’t
> see any progress indication on the Impala web UI). I’ll try and hunt
> through the Hive metastore log files and see if I can find anything.
>
>
>
> Is there something specific you’re looking for in the show create table
> output? I can dump that here (rather than the full table definition and
> details which is pretty verbose and I might need to check if it’s ok to
> share externally).
>
>
>
> -- Piyush
>
>
>
>
>
> *From: *Mostafa Mokhtar <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Wednesday, January 3, 2018 at 1:48 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Issues running compute incremental stats in Impala - alter
> not possible
>
>
>
> Also check the Hive Metastore log files.
>
>
>
> In general if the table has a large number of partitions incremental stats
> will have very large overhead in terms of metadata.
>
>
>
> I would recommend running "compute stats bi_ full" then manually set the
> row count for newly added partitions whenever possible.
>
>
>
> On Wed, Jan 3, 2018 at 10:36 AM, Alexander Behm <[email protected]>
> wrote:
>
> Thanks for the report. I have not seen this issue. Looks like the alter
> RPC is rejected by the Hive Metastore. Maybe looking into the
> Hive/Metastore logs would help.
>
>
>
> The SHOW CREATE TABLE output might also help us debug.
>
>
>
> On Wed, Jan 3, 2018 at 10:28 AM, Piyush Narang <[email protected]>
> wrote:
>
> Hi folks,
>
>
>
> I’m running into some issues when I try to compute incremental stats in
> Impala that I was hoping someone would be able to help with. I’m able to
> ‘compute stats’ in Impala on my smaller tables just fine. When I try
> computing stats incrementally for one of my larger tables, I seem to be
> running into this error:
>
> > compute incremental stats bi_ full partition (param1=0,day='2017-10-04',
> hour=00,host_platform='EU');
>
> Query: compute incremental stats bi_full partition
> (param1=0,day='2017-10-04',hour=00,host_platform='EU')
>
> WARNINGS: ImpalaRuntimeException: Error making 'alter_partitions' RPC to
> Hive Metastore:
>
> CAUSED BY: InvalidOperationException: alter is not possible
>
>
>
> Looking at impalad.INFO and catalogd.INFO I don’t see any additional
> details. I verified that I’m the owner of the tables in HDFS.
>
>
>
> Has anyone run into this issue in the past? Any workarounds?
>
>
>
> Thanks,
>
>
>
> -- Piyush
>
>
>
>
>
>
>
>
>

Reply via email to