Hi Piyush, If you have rough understanding of NDV and row count you can set them manually via alter table command below: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_perf_stats.html#perf_column_stats_manual
There are efforts to add sampled statistics support to Impala which should help in your case. Thanks Mostafa On Thu, Jan 4, 2018 at 10:26 AM, Piyush Narang <[email protected]> wrote: > Seems like my attempt to compute full stats for this table failed as well. > Like Mostafa pointed out the bulk of the overhead was indeed in the select > ndv(c1), … query. The query ends up spending over 5 hours there. > Unfortunately, it seems to fail on the Hive metastore update. Digging into > that with some folks on our end. > > > > -- Piyush > > > > > > *From: *Piyush Narang <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, January 3, 2018 at 2:49 PM > > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Issues running compute incremental stats in Impala - alter > not possible > > > > Thanks for pointing this out. Kicked off a run of this. Shall get back > with breakdowns and how it goes. > > > > -- Piyush > > > > > > *From: *Mostafa Mokhtar <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, January 3, 2018 at 2:13 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Issues running compute incremental stats in Impala - alter > not possible > > > > When you looked at the Web-UI while compute stats was running what did you > find? > > > > In general compute stats has 3 phases > > 1. Get row count per partition using something like select count(*), > partition_id from foo group by partition_id > 2. SELECT NDV(C1), Max size (C1), Avg size (C1), ... NDV(CN), Max size > (CN), Avg size (CN) from foo > 3. Persist the captured stats in Hive Meta store > > > > For steps #1 and #2 you should be able to get a good read on progress > using the "Scan Progress" column. > > Step 3 if you tail /var/log/catalogd/catalogd.INFO you should see the > progress there. > > > > And if the table is large in terms of on disk size I expect steps #2 to > dominate the time. > > > > If the operation is CPU not IO bound increasing mt_dop should give you > good speedup, yet I don't recommend a value greater than 16. > > > > > > On Wed, Jan 3, 2018 at 11:03 AM, Piyush Narang <[email protected]> > wrote: > > Thanks Alex and Mostafa. I tried running compute stats full a couple of > weeks back on this table and it was still going 4 hours later (and I didn’t > see any progress indication on the Impala web UI). I’ll try and hunt > through the Hive metastore log files and see if I can find anything. > > > > Is there something specific you’re looking for in the show create table > output? I can dump that here (rather than the full table definition and > details which is pretty verbose and I might need to check if it’s ok to > share externally). > > > > -- Piyush > > > > > > *From: *Mostafa Mokhtar <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, January 3, 2018 at 1:48 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Issues running compute incremental stats in Impala - alter > not possible > > > > Also check the Hive Metastore log files. > > > > In general if the table has a large number of partitions incremental stats > will have very large overhead in terms of metadata. > > > > I would recommend running "compute stats bi_ full" then manually set the > row count for newly added partitions whenever possible. > > > > On Wed, Jan 3, 2018 at 10:36 AM, Alexander Behm <[email protected]> > wrote: > > Thanks for the report. I have not seen this issue. Looks like the alter > RPC is rejected by the Hive Metastore. Maybe looking into the > Hive/Metastore logs would help. > > > > The SHOW CREATE TABLE output might also help us debug. > > > > On Wed, Jan 3, 2018 at 10:28 AM, Piyush Narang <[email protected]> > wrote: > > Hi folks, > > > > I’m running into some issues when I try to compute incremental stats in > Impala that I was hoping someone would be able to help with. I’m able to > ‘compute stats’ in Impala on my smaller tables just fine. When I try > computing stats incrementally for one of my larger tables, I seem to be > running into this error: > > > compute incremental stats bi_ full partition (param1=0,day='2017-10-04', > hour=00,host_platform='EU'); > > Query: compute incremental stats bi_full partition > (param1=0,day='2017-10-04',hour=00,host_platform='EU') > > WARNINGS: ImpalaRuntimeException: Error making 'alter_partitions' RPC to > Hive Metastore: > > CAUSED BY: InvalidOperationException: alter is not possible > > > > Looking at impalad.INFO and catalogd.INFO I don’t see any additional > details. I verified that I’m the owner of the tables in HDFS. > > > > Has anyone run into this issue in the past? Any workarounds? > > > > Thanks, > > > > -- Piyush > > > > > > > > >
