[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481
 ] 

Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:11 AM:


Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original issue(with table name changed):

 The query result is different for the same store_id when change limit 10 to 
limit 100
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
  
{code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_4] |
|   PartitionCols:_col0  |
|   Group By Operator [GBY_3] (rows=11237665 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
|   Output:["store_id"]  |
|   Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
|   
db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"]
 |
{code}
 part of the extended plan:
{code:sql}
++
|  Explain   |
++
| STAGE DEPENDENCIES:|
|   Stage-1 is a root stage  |
|   Stage-0 depends on stages: Stage-1   |
||
| STAGE PLANS:   |
|   Stage: Stage-1   |
| Tez|
|   DagId: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 |
|   Edges:   |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
|   DagName: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 |
|   Vertices:|
| Map 1  |
| Map Operator Tree: |
| TableScan  |
|   alias: table_name  |
|   Statistics: Num rows: 22475330 Data size: 876537870 Basic 
stats: COMPLETE Column stats: NONE |
|   GatherStats: false   |
|   Filter Operator  |
| isSamplingPred: false  |
| predicate: (NVL(is_curr_bound,1) = 1) (type: boolean) |
| Statistics: Num rows: 11237665 Data size: 438268935 Basic 
stats: COMPLETE Column stats: NONE |
| Select Operator|
|   expressions: store_id (type: string) |
|   outputColumnNames: store_id  |
|   Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
|   Group By Operator|
| aggregations: count()  |
| keys: store_id (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
|

[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481
 ] 

Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:02 AM:


Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original issue(with table name changed):

 The query result is different for the same store_id when change limit 10 to 
limit 100
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
  
{code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_4] |
|   PartitionCols:_col0  |
|   Group By Operator [GBY_3] (rows=11237665 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
|   Output:["store_id"]  |
|   Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
|   
db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"]
 |
{code}
 part of the extended plan:
{code:sql}
 | Reduce Output Operator |
|   key expressions: _col0 (type: string) |
|   null sort order: a   |
|   sort order: +|
|   Map-reduce partition columns: _col0 (type: string) |
|   Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
|   tag: -1  |
|   TopN: 10 |
|   TopN Hash Memory Usage: 0.1 |
|   value expressions: _col1 (type: bigint) |
|   auto parallelism: true   |
{code}


was (Author: nemon):
Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original sql (with table name changed):

 {code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}

query plan :
 {code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
|