[ 
https://issues.apache.org/jira/browse/IMPALA-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768967#comment-17768967
 ] 

ASF subversion and git services commented on IMPALA-2201:
---------------------------------------------------------

Commit 45d6815821a29b83c7a3daa3d380a40e0e4f3836 in impala's branch 
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=45d681582 ]

IMPALA-12462: Update only changed partitions after COMPUTE STATS

This is mainly a revert of https://gerrit.cloudera.org/#/c/640/ but
some parts had to be updated due to changes in Impala.
See IMPALA-2201 for details about why this optimization was removed.

The patch can massively speed up COMPUTE STATS statement when the
majority of partitions has no changes.
COMPUTE STATS tpcds_parquet.store_sales;
before: 12s
after:   1s

Besides the DDL speed up the number of HMS events generated is also
reduced.

Testing:
- added test to verify COMPUTE STATS output
- correctness of cases when something is modified should be covered
  by existing tests
- core tests passed

Change-Id: If2703e0790d5c25db98ed26f26f6d96281c366a3
Reviewed-on: http://gerrit.cloudera.org:8080/20505
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Wenzhe Zhou <wz...@cloudera.com>


> Compute [incremental] stats may not persist the stats if the data was loaded 
> from Hive with hive.stats.autogather=true.
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-2201
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2201
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.2
>            Reporter: Alexander Behm
>            Assignee: Alexander Behm
>            Priority: Blocker
>              Labels: correctness, supportability, usability
>             Fix For: Impala 2.2.7, Impala 2.3.0
>
>
> *Symptoms of This Bug*
> - Stats have been computed, but the row count reverts back to -1 after an 
> INVALIDATE METADATA
> - A compute [incremental] stats appears to not set the row count
> Example scenario where this bug may happen:
> 1. A new partition with new data is loaded into a table via Hive
> 2. Hive has hive.stats.autogather=true
> 3. Stats on the new partition are computed in Impala with COMPUTE INCREMENTAL 
> STATS <partition>
> 4. At this point, SHOW TABLE STATS shows the correct row count
> 5. INVALIDATE METADATA is run on the table in Impala
> 6. The row count reverts back to -1 because the stats have not been persisted
> *Explanation for This Bug*
> Here is why the stats is reset to -1. When Hive hive.stats.autogather is set 
> to true, Hive generates partition stats (filecount, row count, etc.) after 
> creating it. If you run "compute incremental stats" in Impala again. you will 
> get the same RowCount, so the following check will not be satisfied and 
> StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK will not be set in Impala's 
> CatalogOpExecutor.java 
> {code}
> ...
>       // Update table stats
>       if (existingRowCount == null || !existingRowCount.equals(newRowCount)) {
>         // The existing row count value wasn't set or has changed.
>         msPartition.putToParameters(StatsSetupConst.ROW_COUNT, newRowCount);
>         
> msPartition.putToParameters(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK,
>             StatsSetupConst.TRUE);
>         updatedPartition = true;
>       }
> ...
> {code}
> When executing the corresponding alterPartition() RPC in the Hive Metastore, 
> the row count will be reset because the STATS_GENERATED_VIA_STATS_TASK 
> parameter was not set.
> Snipped from Hive's MetaStoreUtils.hava:
> {code}
> ...
> public static boolean 
> updatePartitionStatsFast(PartitionSpecProxy.PartitionIterator part, Warehouse 
> wh,
>       boolean madeDir, boolean forceRecompute) throws MetaException {
> ...
>         
> if(!params.containsKey(StatsSetupConst.STATS_GENERATED_VIA_STATS_TASK)) {
>           // invalidate stats requiring scan since this is a regular ddl 
> alter case
>           for (String stat : StatsSetupConst.statsRequireCompute) {
>             params.put(stat, "-1");
>           }
>           params.put(StatsSetupConst.COLUMN_STATS_ACCURATE, 
> StatsSetupConst.FALSE);
>         }
> ...
> {code}
> So if partition stats already exists but not computed by impala, compute 
> incremental stats will cause stats been reset back to -1.
> Note that in Hive versions after CDH 5.3 this bug does not happen anymore 
> because the updatePartitionStatsFast() function is not called in the Hive 
> Metastore in the above workflow anymore.
> *Workarounds*
> 1. Disable stats autogathering in Hive when loading the data
> {code}
> SET hive.stats.autogather=false;
> {code}
> 2. Manually alter the numRows to -1 before doing COMPUTE [INCREMENTAL] STATS 
> in Impala
> {code}
> ALTER TABLE <table_name> PARTITION <partition_spec> SET TBLPROPERTIES 
> ('numRows'='-1');
> {code}
> 3. When already in the broken "-1" state, re-computing the stats for the 
> affected partition fixes the problem
> *Proposed Solution*
> While this is arguably a Hive bug, I'd recommend that Impala should just 
> unconditionally update the stats when running a COMPUTE STATS. Making the 
> behavior dependent on the existing metadata state is brittle and hard to 
> reason about and debug, esp. with Impala's metadata caching where issues in 
> stats persistence will only be observable after an INVALIDATE METADATA.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to