[ https://issues.apache.org/jira/browse/IMPALA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768967#comment-17768967 ]
ASF subversion and git services commented on IMPALA-2201: --------------------------------------------------------- Commit 45d6815821a29b83c7a3daa3d380a40e0e4f3836 in impala's branch refs/heads/master from Csaba Ringhofer [ https://gitbox.apache.org/repos/asf?p=impala.git;h=45d681582 ] IMPALA-12462: Update only changed partitions after COMPUTE STATS This is mainly a revert of https://gerrit.cloudera.org/#/c/640/ but some parts had to be updated due to changes in Impala. See IMPALA-2201 for details about why this optimization was removed. The patch can massively speed up COMPUTE STATS statement when the majority of partitions has no changes. COMPUTE STATS tpcds_parquet.store_sales; before: 12s after: 1s Besides the DDL speed up the number of HMS events generated is also reduced. Testing: - added test to verify COMPUTE STATS output - correctness of cases when something is modified should be covered by existing tests - core tests passed Change-Id: If2703e0790d5c25db98ed26f26f6d96281c366a3 Reviewed-on: http://gerrit.cloudera.org:8080/20505 Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Reviewed-by: Wenzhe Zhou <wz...@cloudera.com> > Compute [incremental] stats may not persist the stats if the data was loaded > from Hive with hive.stats.autogather=true. > ----------------------------------------------------------------------------------------------------------------------- > > Key: IMPALA-2201 > URL: https://issues.apache.org/jira/browse/IMPALA-2201 > Project: IMPALA > Issue Type: Bug > Affects Versions: Impala 2.2 > Reporter: Alexander Behm > Assignee: Alexander Behm > Priority: Blocker > Labels: correctness, supportability, usability > Fix For: Impala 2.2.7, Impala 2.3.0 > > > *Symptoms of This Bug* > - Stats have been computed, but the row count reverts back to -1 after an > INVALIDATE METADATA > - A compute [incremental] stats appears to not set the row count > Example scenario where this bug may happen: > 1. A new partition with new data is loaded into a table via Hive > 2. Hive has hive.stats.autogather=true > 3. Stats on the new partition are computed in Impala with COMPUTE INCREMENTAL > STATS <partition> > 4. At this point, SHOW TABLE STATS shows the correct row count > 5. INVALIDATE METADATA is run on the table in Impala > 6. The row count reverts back to -1 because the stats have not been persisted > *Explanation for This Bug* > Here is why the stats is reset to -1. When Hive hive.stats.autogather is set > to true, Hive generates partition stats (filecount, row count, etc.) after > creating it. If you run "compute incremental stats" in Impala again. you will > get the same RowCount, so the following check will not be satisfied and > StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK will not be set in Impala's > CatalogOpExecutor.java > {code} > ... > // Update table stats > if (existingRowCount == null || !existingRowCount.equals(newRowCount)) { > // The existing row count value wasn't set or has changed. > msPartition.putToParameters(StatsSetupConst.ROW_COUNT, newRowCount); > > msPartition.putToParameters(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK, > StatsSetupConst.TRUE); > updatedPartition = true; > } > ... > {code} > When executing the corresponding alterPartition() RPC in the Hive Metastore, > the row count will be reset because the STATS_GENERATED_VIA_STATS_TASK > parameter was not set. > Snipped from Hive's MetaStoreUtils.hava: > {code} > ... > public static boolean > updatePartitionStatsFast(PartitionSpecProxy.PartitionIterator part, Warehouse > wh, > boolean madeDir, boolean forceRecompute) throws MetaException { > ... > > if(!params.containsKey(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK)) { > // invalidate stats requiring scan since this is a regular ddl > alter case > for (String stat : StatsSetupConst.statsRequireCompute) { > params.put(stat, "-1"); > } > params.put(StatsSetupConst.COLUMN_STATS_ACCURATE, > StatsSetupConst.FALSE); > } > ... > {code} > So if partition stats already exists but not computed by impala, compute > incremental stats will cause stats been reset back to -1. > Note that in Hive versions after CDH 5.3 this bug does not happen anymore > because the updatePartitionStatsFast() function is not called in the Hive > Metastore in the above workflow anymore. > *Workarounds* > 1. Disable stats autogathering in Hive when loading the data > {code} > SET hive.stats.autogather=false; > {code} > 2. Manually alter the numRows to -1 before doing COMPUTE [INCREMENTAL] STATS > in Impala > {code} > ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES > ('numRows'='-1'); > {code} > 3. When already in the broken "-1" state, re-computing the stats for the > affected partition fixes the problem > *Proposed Solution* > While this is arguably a Hive bug, I'd recommend that Impala should just > unconditionally update the stats when running a COMPUTE STATS. Making the > behavior dependent on the existing metadata state is brittle and hard to > reason about and debug, esp. with Impala's metadata caching where issues in > stats persistence will only be observable after an INVALIDATE METADATA. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org