[jira] [Resolved] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions

Rajesh Balamohan (Jira) Sun, 21 Feb 2021 22:31:12 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-24764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan resolved HIVE-24764.
-------------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Thanks [~ashutoshc] for the review. Merged the PR.

> insert overwrite on a partition resets row count stats in other partitions
> --------------------------------------------------------------------------
>
>                 Key: HIVE-24764
>                 URL: https://issues.apache.org/jira/browse/HIVE-24764
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> After insert overwrite on a partition, stats on other partitions are messed 
> up. Subsequent queries end up with plans with PARTIAL stats. In certain 
> cases, this leads to suboptimal query plans.
> {noformat}
> drop table if exists test_stats;
> drop table if exists test_stats_2;
> create table test_stats(i int, j bigint);
> create table test_stats_2(i int) partitioned by (j bigint);
> insert into test_stats values (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> -- select * from test_stats;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   <null>
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is not null;
> -- After executing this statement, stat gets messed up.
> insert overwrite table test_stats_2 partition(j)  select i, j from test_stats 
> where j is null;
> -- select * from test_stats_2;
> 1   1
> 2   2
> 3   3
> 4   4
> 5   <null>
> -- This would return "PARTIAL" stats instead of "COMPLETE"
> explain select i, count(*) as c from test_stats_2 group by i order by c desc 
> limit 10;
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
>       DagName: hive_20210208093110_62ced99e-f068-42d4-9ba8-d45fccd6c0a2:68
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: test_stats_2
>                   Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                   Select Operator
>                     expressions: i (type: int)
>                     outputColumnNames: i
>                     Statistics: Num rows: 125 Data size: 500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                     Group By Operator
>                       aggregations: count()
>                       keys: i (type: int)
>                       minReductionHashAggr: 0.99
>                       mode: hash
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 125 Data size: 1500 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: int)
>                         null sort order: a
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: int)
>                         Statistics: Num rows: 125 Data size: 1500 Basic 
> stats: PARTIAL Column stats: COMPLETE
>                         value expressions: _col1 (type: bigint)
>             Execution mode: vectorized, llap
>             LLAP IO: may be used (ACID table)
>         Reducer 2 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 keys: KEY._col0 (type: int)
>                 mode: mergepartial
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
>                 Top N Key Operator
>                   sort order: -
>                   keys: _col1 (type: bigint)
>                   null sort order: a
>                   Statistics: Num rows: 62 Data size: 744 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                   top n: 10
>                   Reduce Output Operator
>                     key expressions: _col1 (type: bigint)
>                     null sort order: a
>                     sort order: -
>                     Statistics: Num rows: 62 Data size: 744 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                     TopN Hash Memory Usage: 0.04
>                     value expressions: _col0 (type: int)
>         Reducer 3 
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Select Operator
>                 expressions: VALUE._col0 (type: int), KEY.reducesinkkey0 
> (type: bigint)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 62 Data size: 744 Basic stats: PARTIAL 
> Column stats: COMPLETE
>                 Limit
>                   Number of rows: 10
>                   Statistics: Num rows: 10 Data size: 120 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 10 Data size: 120 Basic stats: 
> PARTIAL Column stats: COMPLETE
>                     table:
>                         input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                         output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                         serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: 10
>       Processor Tree:
>         ListSink
> {noformat}
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/stats/fs/FSStatsAggregator.java#L138
> FSStats should return "null" or "" if partitions are not present in its list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HIVE-24764) insert overwrite on a partition resets row count stats in other partitions

Reply via email to