[jira] [Work logged] (HIVE-25202) Support decimal64 operations for PTF operators

2021-09-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25202?focusedWorklogId=649647=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-649647
 ]

ASF GitHub Bot logged work on HIVE-25202:
-

Author: ASF GitHub Bot
Created on: 12/Sep/21 00:11
Start Date: 12/Sep/21 00:11
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #2416:
URL: https://github.com/apache/hive/pull/2416#issuecomment-917508141


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 649647)
Time Spent: 1h 10m  (was: 1h)

> Support decimal64 operations for PTF operators
> --
>
> Key: HIVE-25202
> URL: https://issues.apache.org/jira/browse/HIVE-25202
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> After the support for decimal64 vectorization for multiple operators, PTF 
> operators were found guilty of breaking the decimal64 chain if they happen to 
> occur between two operators. As a result they introduce unnecessary cast to 
> decimal. In order to prevent this, we will support PTF operators to handle 
> decimal64 data types too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-25296) Replace parquet-hadoop-bundle dependency with the actual parquet modules

2021-09-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-25296?focusedWorklogId=649648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-649648
 ]

ASF GitHub Bot logged work on HIVE-25296:
-

Author: ASF GitHub Bot
Created on: 12/Sep/21 00:11
Start Date: 12/Sep/21 00:11
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] closed pull request #2288:
URL: https://github.com/apache/hive/pull/2288


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 649648)
Time Spent: 40m  (was: 0.5h)

> Replace parquet-hadoop-bundle dependency with the actual parquet modules
> 
>
> Key: HIVE-25296
> URL: https://issues.apache.org/jira/browse/HIVE-25296
> Project: Hive
>  Issue Type: Improvement
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The parquet-hadoop-bundle is not a real dependency but a mere packaging
> of three parquet modules to create an uber jar. The Parquet community
> created this artificial module on demand by HIVE-5783 but the
> benefits if any are unclear.
> On the contrary using the uber dependency has some drawbacks:
> * Parquet souce code cannot be attached easily in IDEs which makes debugging 
> sessions cumbersome.
> * Finding concrete dependencies with Parquet is not possible just by 
> inspecting the pom files.
> * Extra maintenance cost for the Parquet community adding additional 
> verification steps during a release.
> The goal of this JIRA is to replace the uber dependency with concrete 
> dependencies to the respective modules:
> * parquet-common
> * parquet-column
> * parquet-hadoop



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413502#comment-17413502
 ] 

Nemon Lou commented on HIVE-24579:
--

I have repoduce this issue.But data is too big to upload(more than 30mb), any 
suggestions? [~kkasa]

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>   file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   location 

[jira] [Updated] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nemon Lou updated HIVE-24579:
-
Attachment: testdata.tar.7z.007

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>   file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   location file:/user/hive/warehouse/test
>   name default.test
>   numFiles 0
>

[jira] [Updated] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nemon Lou updated HIVE-24579:
-
Attachment: (was: testdata.tar.7z.007)

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>   file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   location file:/user/hive/warehouse/test
>   name default.test
>   numFiles 0
> 

[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481
 ] 

Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:11 AM:


Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original issue(with table name changed):

 The query result is different for the same store_id when change limit 10 to 
limit 100
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
  
{code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_4] |
|   PartitionCols:_col0  |
|   Group By Operator [GBY_3] (rows=11237665 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
|   Output:["store_id"]  |
|   Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
|   
db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"]
 |
{code}
 part of the extended plan:
{code:sql}
++
|  Explain   |
++
| STAGE DEPENDENCIES:|
|   Stage-1 is a root stage  |
|   Stage-0 depends on stages: Stage-1   |
||
| STAGE PLANS:   |
|   Stage: Stage-1   |
| Tez|
|   DagId: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 |
|   Edges:   |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
|   DagName: omm_20201228025339_1ef293cf-c508-431a-bf00-6df95178c6e8:3229 |
|   Vertices:|
| Map 1  |
| Map Operator Tree: |
| TableScan  |
|   alias: table_name  |
|   Statistics: Num rows: 22475330 Data size: 876537870 Basic 
stats: COMPLETE Column stats: NONE |
|   GatherStats: false   |
|   Filter Operator  |
| isSamplingPred: false  |
| predicate: (NVL(is_curr_bound,1) = 1) (type: boolean) |
| Statistics: Num rows: 11237665 Data size: 438268935 Basic 
stats: COMPLETE Column stats: NONE |
| Select Operator|
|   expressions: store_id (type: string) |
|   outputColumnNames: store_id  |
|   Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
|   Group By Operator|
| aggregations: count()  |
| keys: store_id (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
|

[jira] [Comment Edited] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481
 ] 

Nemon Lou edited comment on HIVE-24579 at 9/11/21, 7:02 AM:


Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original issue(with table name changed):

 The query result is different for the same store_id when change limit 10 to 
limit 100
{code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}
query plan :
  
{code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_4] |
|   PartitionCols:_col0  |
|   Group By Operator [GBY_3] (rows=11237665 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
|   Output:["store_id"]  |
|   Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
|   
db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"]
 |
{code}
 part of the extended plan:
{code:sql}
 | Reduce Output Operator |
|   key expressions: _col0 (type: string) |
|   null sort order: a   |
|   sort order: +|
|   Map-reduce partition columns: _col0 (type: string) |
|   Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
|   tag: -1  |
|   TopN: 10 |
|   TopN Hash Memory Usage: 0.1 |
|   value expressions: _col1 (type: bigint) |
|   auto parallelism: true   |
{code}


was (Author: nemon):
Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original sql (with table name changed):

 {code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}

query plan :
 {code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481
 ] 

Nemon Lou commented on HIVE-24579:
--

Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original sql (with table name changed):

 {code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}

query plan :
 {code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_4] |
|   PartitionCols:_col0  |
|   Group By Operator [GBY_3] (rows=11237665 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
|   Output:["store_id"]  |
|   Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
|   
db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"]
 |
{code}
 part of the extended plan:
{code:sql}
  Reduce Output Operator |
   key expressions: _col0 (type: string) |
   null sort order: a   |
   sort order: +|
   Map-reduce partition columns: _col0 (type: string) |
   Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
   tag: -1  |
   TopN: 100|
   TopN Hash Memory Usage: 0.1 |
   value expressions: _col1 (type: bigint) |
   auto parallelism: true   |
{code}


> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
>