[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit
[ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481 ] Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:11 AM: Thanks [~kkasa] for your attention. This issue only happens on a customer's cluster, and i could not get the data. This simplified reproduce step seems not match the customer's issue. Here is the original issue(with table name changed): The query result is different for the same store_id when change limit 10 to limit 100 {code:sql} SELECT store_id store_id_hive , count(1) device_cnt_bound_30day FROM db_name.table_name WHERE i_rep_date <= 20201226 AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') as int) AND nvl(is_curr_bound,1) = 1 group by store_id limit 10; {code} query plan : {code:sql} | Explain | ++ | Plan optimized by CBO. | || | Vertex dependency in root stage| | Reducer 2 <- Map 1 (SIMPLE_EDGE) | || | Stage-0| | Fetch Operator | | limit:10 | | Stage-1| | Reducer 2| | File Output Operator [FS_8] | | Limit [LIM_7] (rows=10 width=39) | | Number of rows:10| | Group By Operator [GBY_5] (rows=5618832 width=39) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE]| | SHUFFLE [RS_4] | | PartitionCols:_col0 | | Group By Operator [GBY_3] (rows=11237665 width=39) | | Output:["_col0","_col1"],aggregations:["count()"],keys:store_id | | Select Operator [SEL_2] (rows=11237665 width=39) | | Output:["store_id"] | | Filter Operator [FIL_9] (rows=11237665 width=39) | | predicate:(NVL(is_curr_bound,1) = 1) | | TableScan [TS_0] (rows=22475330 width=39) | | db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] | {code} part of the extended plan: {code:sql} ++ | Explain | ++ | STAGE DEPENDENCIES:| | Stage-1 is a root stage | | Stage-0 depends on stages: Stage-1 | || | STAGE PLANS: | | Stage: Stage-1 | | Tez| | DagId: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 | | Edges: | | Reducer 2 <- Map 1 (SIMPLE_EDGE) | | DagName: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 | | Vertices:| | Map 1 | | Map Operator Tree: | | TableScan | | alias: table_name | | Statistics: Num rows: 22475330 Data size: 876537870 Basic stats: COMPLETE Column stats: NONE | | GatherStats: false | | Filter Operator | | isSamplingPred: false | | predicate: (NVL(is_curr_bound,1) = 1) (type: boolean) | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | Select Operator| | expressions: store_id (type: string) | | outputColumnNames: store_id | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | Group By Operator| | aggregations: count() | | keys: store_id (type: string) | | mode: hash | | outputColumnNames: _col0, _col1 | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | |
[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit
[ https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481 ] Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:02 AM: Thanks [~kkasa] for your attention. This issue only happens on a customer's cluster, and i could not get the data. This simplified reproduce step seems not match the customer's issue. Here is the original issue(with table name changed): The query result is different for the same store_id when change limit 10 to limit 100 {code:sql} SELECT store_id store_id_hive , count(1) device_cnt_bound_30day FROM db_name.table_name WHERE i_rep_date <= 20201226 AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') as int) AND nvl(is_curr_bound,1) = 1 group by store_id limit 10; {code} query plan : {code:sql} | Explain | ++ | Plan optimized by CBO. | || | Vertex dependency in root stage| | Reducer 2 <- Map 1 (SIMPLE_EDGE) | || | Stage-0| | Fetch Operator | | limit:10 | | Stage-1| | Reducer 2| | File Output Operator [FS_8] | | Limit [LIM_7] (rows=10 width=39) | | Number of rows:10| | Group By Operator [GBY_5] (rows=5618832 width=39) | | Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | | <-Map 1 [SIMPLE_EDGE]| | SHUFFLE [RS_4] | | PartitionCols:_col0 | | Group By Operator [GBY_3] (rows=11237665 width=39) | | Output:["_col0","_col1"],aggregations:["count()"],keys:store_id | | Select Operator [SEL_2] (rows=11237665 width=39) | | Output:["store_id"] | | Filter Operator [FIL_9] (rows=11237665 width=39) | | predicate:(NVL(is_curr_bound,1) = 1) | | TableScan [TS_0] (rows=22475330 width=39) | | db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"] | {code} part of the extended plan: {code:sql} | Reduce Output Operator | | key expressions: _col0 (type: string) | | null sort order: a | | sort order: +| | Map-reduce partition columns: _col0 (type: string) | | Statistics: Num rows: 11237665 Data size: 438268935 Basic stats: COMPLETE Column stats: NONE | | tag: -1 | | TopN: 10 | | TopN Hash Memory Usage: 0.1 | | value expressions: _col1 (type: bigint) | | auto parallelism: true | {code} was (Author: nemon): Thanks [~kkasa] for your attention. This issue only happens on a customer's cluster, and i could not get the data. This simplified reproduce step seems not match the customer's issue. Here is the original sql (with table name changed): {code:sql} SELECT store_id store_id_hive , count(1) device_cnt_bound_30day FROM db_name.table_name WHERE i_rep_date <= 20201226 AND i_rep_date >= cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') as int) AND nvl(is_curr_bound,1) = 1 group by store_id limit 10; {code} query plan : {code:sql} | Explain | ++ | Plan optimized by CBO. | || | Vertex dependency in root stage| | Reducer 2 <- Map 1 (SIMPLE_EDGE) | || | Stage-0| | Fetch Operator | | limit:10 | | Stage-1| | Reducer 2| | File Output Operator [FS_8] | | Limit [LIM_7] (rows=10 width=39) | | Number of rows:10| | Group By Operator [GBY_5] (rows=5618832 width=39) | |