[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-23 Thread Krisztian Kasa (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419176#comment-17419176
 ] 

Krisztian Kasa commented on HIVE-24579:
---

[~nemon]
 Thanks for reviewing the patch.
 1. The original query is about returning the first n keys coming from the 
Group by operator. It can be any n keys. So I think many good result exists. By 
adding sorting we choose one of those possible n key sets.
 2. Run some tests in a 3 node test cluster and
 * found that with low NDV of group by keys the additional reducer for ordering 
is negligible sometimes even faster.
 * high NDV: topN + ordering is much faster. I think when topN is enabled the 
amount of data broadcasted to reducers from mappers can be much less.
 
3. Not sure cbo=false is recommended. Lots of features are not working without 
cbo.

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>  Components: Physical Optimizer
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-22 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418562#comment-17418562
 ] 

Nemon Lou commented on HIVE-24579:
--

[~kkasa] Good job! 
After reading your PR, I have some concerns.
1. Does the sorting stage cause compatibility problems? For example, the 
returned content is different from the original after sort. 
(There are many examples in the .q.out file). This seems to be less of a 
problem than the incorrect result.
2. Faster by topn + order by, or faster by reducing one stage (no topn + no 
order by)? Do different solutions need to be selected for different scenarios?
3. In the scenario where cbo=false, do we need to fix it?
Thanks.

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>  Components: Physical Optimizer
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-20 Thread Krisztian Kasa (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417640#comment-17417640
 ] 

Krisztian Kasa commented on HIVE-24579:
---

[~nemon]
 Managed to reproduce this using a small dataset:
{code:java}
SET hive.optimize.topnkey=false;
SET hive.vectorized.execution.enabled=false;

-- Force using multiple mappers.
set tez.grouping.max-size=50;
set tez.grouping.min-size=25;

create table test(id string);
insert into test(id) values
(4), (3), (4), (3), (4), (3), (4), (3), (4), (3), (4), (3), (4), (3), (4), (3), 
(4), (3), (4), (10), (2), (1), (5);

select id, count(1) from test group by id limit 2;
{code}
{code:java}
10  1
3   6  -- This should be 9
{code}
I run above script using the TestMiniLlapLocalCliDriver.
 By setting
{code:java}
SET hive.optimize.topnkey=false;
{code}
TopNKeyOperator is disabled so *TopNHash* in the *ReduceSinkOperator* is used 
instead.

I had two mappers and two reducers.
Each mapper has its own TopNHash instance. Since many rows have id=3 these rows 
are processed by both mappers and TopNHash instances. However in one of the 
mappers id=3 rows are filtered out by other rows.

In the reducers the Group by operators (calculates the final result of 
aggregations) processed these rows:
{code:java}
Reducer1 GBY: 
Key, Partial count
[10] 1
[3] 6
Reducer2 GBY:
Key, Partial count
[1] 1
[4] 7
{code}
In the end LimitOperators just take these rows and in the final step 
FetchOperator takes the first output file produced by one of the 
LimitOperators. The fetch task has also a limit of 2 so the it does not touch 
the second file created by the second LimitOperator.

TopNKeyOperator is a kind of upgrade and a replacement of TopNHash. It can be 
even pushed down until TableScan operator if the conditions meet. So I don't 
think It would solve this issue.

I think the solution is when GBY has aggregate functions transform the plan to 
a *group by id order by id limit 2* plan. It has the same plan like in the 
description but It has an additional edge with a single Reducer. It is 
responsible to do a merge sort on the outputs of its source reducers and get 
the first n row of the result.


> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Assignee: Krisztian Kasa
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-13 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414657#comment-17414657
 ] 

Nemon Lou commented on HIVE-24579:
--

Another user also reports the same issue : 
https://issues.apache.org/jira/browse/HIVE-3562?focusedCommentId=17170367=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17170367

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>   file.outputformat 
> 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-13 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414151#comment-17414151
 ] 

Nemon Lou commented on HIVE-24579:
--

I think topn key operator has the same issue. What's your Opinion?  [~kkasa]

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>   file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   location file:/user/hive/warehouse/test
>   

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-13 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414006#comment-17414006
 ] 

Nemon Lou commented on HIVE-24579:
--

After debuging,I find the bug is quite intuitive:

There is no order granted in the final result, but TopN in mapper filters out 
part of the data. Causing incorrect aggragation result of some keys.For example:

Assume that key1 is in the top 10 key at first and then is squeezed by other 
keys, but some data is still transmitted to the downstream. As a result, key1 
obtains an incorrect summarization result in the reduce phase.
However, the final result is not obtained from the top 10 keys but from the 
output results of multiple reduce. Therefore, key1 may be obtained, causing an 
error in the final result.

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413502#comment-17413502
 ] 

Nemon Lou commented on HIVE-24579:
--

I have repoduce this issue.But data is too big to upload(more than 30mb), any 
suggestions? [~kkasa]

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   file.inputformat 
> org.apache.hadoop.mapred.TextInputFormat
>   file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   location 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-11 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413481#comment-17413481
 ] 

Nemon Lou commented on HIVE-24579:
--

Thanks [~kkasa] for your attention.

This issue only happens on a customer's cluster, and i could not get the data.

This simplified reproduce step seems not match the customer's issue.

Here is the original sql (with table name changed):

 {code:sql}
SELECT store_id store_id_hive
, count(1) device_cnt_bound_30day
FROM db_name.table_name
WHERE i_rep_date <= 20201226
AND i_rep_date >= 
cast(from_unixtime(unix_timestamp('20201226','MMdd')-86400*29,'MMdd') 
as int)
AND nvl(is_curr_bound,1) = 1
group by store_id limit 10;
{code}

query plan :
 {code:sql}
|  Explain   |
++
| Plan optimized by CBO. |
||
| Vertex dependency in root stage|
| Reducer 2 <- Map 1 (SIMPLE_EDGE)   |
||
| Stage-0|
|   Fetch Operator   |
| limit:10   |
| Stage-1|
|   Reducer 2|
|   File Output Operator [FS_8]  |
| Limit [LIM_7] (rows=10 width=39)   |
|   Number of rows:10|
|   Group By Operator [GBY_5] (rows=5618832 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
|   <-Map 1 [SIMPLE_EDGE]|
| SHUFFLE [RS_4] |
|   PartitionCols:_col0  |
|   Group By Operator [GBY_3] (rows=11237665 width=39) |
| 
Output:["_col0","_col1"],aggregations:["count()"],keys:store_id |
| Select Operator [SEL_2] (rows=11237665 width=39) |
|   Output:["store_id"]  |
|   Filter Operator [FIL_9] (rows=11237665 width=39) |
| predicate:(NVL(is_curr_bound,1) = 1) |
| TableScan [TS_0] (rows=22475330 width=39) |
|   
db_name@table_name,table_name,Tbl:COMPLETE,Col:NONE,Output:["store_id","is_curr_bound"]
 |
{code}
 part of the extended plan:
{code:sql}
  Reduce Output Operator |
   key expressions: _col0 (type: string) |
   null sort order: a   |
   sort order: +|
   Map-reduce partition columns: _col0 (type: string) |
   Statistics: Num rows: 11237665 Data size: 438268935 
Basic stats: COMPLETE Column stats: NONE |
   tag: -1  |
   TopN: 100|
   TopN Hash Memory Usage: 0.1 |
   value expressions: _col1 (type: bigint) |
   auto parallelism: true   |
{code}


> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> 

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-09-10 Thread Krisztian Kasa (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413144#comment-17413144
 ] 

Krisztian Kasa commented on HIVE-24579:
---

[~nemon]
The TopN is introduced because the Reduce Sink operator does sorting by the id 
column and there is also a limit 10 defined. So it broadcast only the first 10 
rows to the reducer.

Could you please provide some test data which can be used to reproduce the 
incorrect result you mention in the summary?

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Major
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: root_20210104141527_c599c0cd-ca2f-4c7d-a3cc-3a01d65c49a1:5
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: test
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   GatherStats: false
>   Select Operator
> expressions: id (type: int)
> outputColumnNames: id
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   aggregations: count()
>   keys: id (type: int)
>   mode: hash
>   outputColumnNames: _col0, _col1
>   Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
>   Reduce Output Operator
> key expressions: _col0 (type: int)
> null sort order: a
> sort order: +
> Map-reduce partition columns: _col0 (type: int)
> Statistics: Num rows: 1 Data size: 13500 Basic stats: 
> COMPLETE Column stats: NONE
> tag: -1
> TopN: 10
> TopN Hash Memory Usage: 0.1
> value expressions: _col1 (type: bigint)
> auto parallelism: true
> Execution mode: vectorized
> Path -> Alias:
>   file:/user/hive/warehouse/test [test]
> Path -> Partition:
>   file:/user/hive/warehouse/test 
> Partition
>   base file name: test
>   input format: org.apache.hadoop.mapred.TextInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   properties:
> COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
> bucket_count -1
> bucketing_version 2
> column.name.delimiter ,
> columns id
> columns.comments 
> columns.types int
> file.inputformat org.apache.hadoop.mapred.TextInputFormat
> file.outputformat 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> location file:/user/hive/warehouse/test
> name default.test
> numFiles 0
> numRows 0
> rawDataSize 0
> serialization.ddl struct test { i32 id}
> serialization.format 1
> serialization.lib 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> totalSize 0
> transient_lastDdlTime 1609730190
>   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> 
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> properties:
>   COLUMN_STATS_ACCURATE 
> {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true"}}
>   bucket_count -1
>   bucketing_version 2
>   column.name.delimiter ,
>   columns id
>   columns.comments 
>   columns.types int
>   

[jira] [Commented] (HIVE-24579) Incorrect Result For Groupby With Limit

2021-01-03 Thread Nemon Lou (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17257932#comment-17257932
 ] 

Nemon Lou commented on HIVE-24579:
--

A workaround is hive.limit.pushdown.memory.usage=0 .

 

> Incorrect Result For Groupby With Limit
> ---
>
> Key: HIVE-24579
> URL: https://issues.apache.org/jira/browse/HIVE-24579
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: Nemon Lou
>Priority: Critical
>
> {code:sql}
> create table test(id int);
> explain extended select id,count(*) from test group by id limit 10;
> {code}
> There is an TopN unexpectly for map phase, which casues incorrect result.
> {code:sql}
> STAGE PLANS:
>  Stage: Stage-1
>  Map Reduce
>  Map Operator Tree:
>  TableScan
>  alias: test
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column 
> stats: NONE
>  GatherStats: false
>  Select Operator
>  expressions: id (type: int)
>  outputColumnNames: id
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column 
> stats: NONE
>  Group By Operator
>  aggregations: count()
>  keys: id (type: int)
>  mode: hash
>  outputColumnNames: _col0, _col1
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column 
> stats: NONE
>  Reduce Output Operator
>  key expressions: _col0 (type: int)
>  null sort order: a
>  sort order: +
>  Map-reduce partition columns: _col0 (type: int)
>  Statistics: Num rows: 337 Data size: 1350 Basic stats: COMPLETE Column 
> stats: NONE
>  tag: -1
>  TopN: 10
>  TopN Hash Memory Usage: 0.1
>  value expressions: _col1 (type: bigint)
>  auto parallelism: false
>  Path -> Alias:
>  file:/user/hive/warehouse/test [test]
>  Path -> Partition:
>  file:/user/hive/warehouse/test 
>  Partition
>  base file name: test
>  input format: org.apache.hadoop.mapred.TextInputFormat
>  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  properties:
>  COLUMN_STATS_ACCURATE \{"BASIC_STATS":"true"}
>  bucket_count -1
>  column.name.delimiter ,
>  columns id
>  columns.comments 
>  columns.types int
>  file.inputformat org.apache.hadoop.mapred.TextInputFormat
>  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  location file:/user/hive/warehouse/test
>  name default.test
>  numFiles 0
>  numRows 0
>  rawDataSize 0
>  serialization.ddl struct test \{ i32 id}
>  serialization.format 1
>  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  totalSize 0
>  transient_lastDdlTime 1609730036
>  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  
>  input format: org.apache.hadoop.mapred.TextInputFormat
>  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  properties:
>  COLUMN_STATS_ACCURATE \{"BASIC_STATS":"true"}
>  bucket_count -1
>  column.name.delimiter ,
>  columns id
>  columns.comments 
>  columns.types int
>  file.inputformat org.apache.hadoop.mapred.TextInputFormat
>  file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>  location file:/user/hive/warehouse/test
>  name default.test
>  numFiles 0
>  numRows 0
>  rawDataSize 0
>  serialization.ddl struct test \{ i32 id}
>  serialization.format 1
>  serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  totalSize 0
>  transient_lastDdlTime 1609730036
>  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>  name: default.test
>  name: default.test
>  Truncated Path -> Alias:
>  /test [test]
>  Needs Tagging: false
>  Reduce Operator Tree:
>  Group By Operator
>  aggregations: count(VALUE._col0)
>  keys: KEY._col0 (type: int)
>  mode: mergepartial
>  outputColumnNames: _col0, _col1
>  Statistics: Num rows: 168 Data size: 672 Basic stats: COMPLETE Column stats: 
> NONE
>  Limit
>  Number of rows: 10
>  Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: 
> NONE
>  File Output Operator
>  compressed: false
>  GlobalTableId: 0
>  directory: 
> file:/tmp/root/bd08973b-b58c-4185-9072-c1891f67878d/hive_2021-01-04_11-14-01_745_4475755683092435506-1/-mr-10001/.hive-staging_hive_2021-01-04_11-14-01_745_4475755683092435506-1/-ext-10002
>  NumFilesPerFileSink: 1
>  Statistics: Num rows: 10 Data size: 40 Basic stats: COMPLETE Column stats: 
> NONE
>  Stats Publishing Key Prefix: 
> file:/tmp/root/bd08973b-b58c-4185-9072-c1891f67878d/hive_2021-01-04_11-14-01_745_4475755683092435506-1/-mr-10001/.hive-staging_hive_2021-01-04_11-14-01_745_4475755683092435506-1/-ext-10002/
>  table:
>  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  properties:
>  columns _col0,_col1
>  columns.types int:bigint
>  escape.delim \
>  hive.serialization.extend.additional.nesting.levels true
>  serialization.escape.crlf true
>