Stamatis Zampetakis created HIVE-23485: ------------------------------------------
Summary: Bound GroupByOperator stats using largest NDV among columns Key: HIVE-23485 URL: https://issues.apache.org/jira/browse/HIVE-23485 Project: Hive Issue Type: Improvement Reporter: Stamatis Zampetakis Assignee: Stamatis Zampetakis Consider the following SQL query: {code:sql} select id, name from person group by id, name; {code} and assume that the person table contains the following tuples: {code:sql} insert into person values (0, 'A') ; insert into person values (1, 'A') ; insert into person values (2, 'B') ; insert into person values (3, 'B') ; insert into person values (4, 'B') ; insert into person values (5, 'C') ; {code} If we know the number of distinct values (NDV) for all columns in the group by clause then we can infer a lower bound for the total number of rows by taking the maximun NDV of the involved columns. Currently the query in the scenario above has the following plan: {noformat} Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 vectorized File Output Operator [FS_11] Group By Operator [GBY_10] (rows=3 width=92) Output:["_col0","_col1"],keys:KEY._col0, KEY._col1 <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_9] PartitionCols:_col0, _col1 Group By Operator [GBY_8] (rows=3 width=92) Output:["_col0","_col1"],keys:id, name Select Operator [SEL_7] (rows=6 width=92) Output:["id","name"] TableScan [TS_0] (rows=6 width=92) default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat} Observe that the stats for group by report 3 rows but given that the ID attribute is part of the aggregation the rows cannot be less than 6. -- This message was sent by Atlassian Jira (v8.3.4#803005)