[jira] [Created] (HIVE-22181) Same query but different result occasionally

2019-09-09 Thread qingfa zhou (Jira)
qingfa zhou created HIVE-22181:
--

 Summary: Same query but different result occasionally
 Key: HIVE-22181
 URL: https://issues.apache.org/jira/browse/HIVE-22181
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.0
 Environment: hadoop:2.7.3.2.6.1.0-129

hive:2.3.0

tez:0.8.4

java:1.8.0_212 
Reporter: qingfa zhou
 Attachments: WX20190910-123604.png, WX20190910-123621.png, 
WX20190910-123641.png

h3. 1.Phenomenon

    Running the same SQL(hive on tez) several times will result in inconsistent 
results . The following results are generated respectively

    1)

     OK

     3951864 808

    2)

     OK

     1822979 353

    But the first result is true.
h3. 2.SQL

select
 count(1),count(distinct store_code)
 from 
 (
 select
 store_code,
 all_pay_id,
 payable_price,
 row_number()over(partition by store_code order by payable_price desc) as rw,
 count(1)over(partition by store_code) as store_user
 from 
 (
 select 
 store_code,
 all_pay_id,
 sum(payable_price) as payable_price
 from data_promotion.mdw_user_promotion_shopping_behaivor_four_week_detail a
 where dt='20190904'
 group by store_code,all_pay_id
 ) a 
 order by rw 
 ) a 
 where rw/store_user<=0.8
 ;

 
h3. 3.Troubleshooting

    I lookup task result from tez web ui,give the result as follows:

           the first: !image-2019-09-10-12-16-45-385.png!

        the second:

!image-2019-09-10-12-18-08-769.png!

      so , The second RECORDS_OUT_INTERMEDIATE_Reducer_4's result is much lower 
than the first.

     This is detailed info of the second result from reducer_3 to reducer_4 

!image-2019-09-10-12-26-41-681.png!  

This is execution plan of the sql:

OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
Reducer 6 <- Reducer 5 (SIMPLE_EDGE)

Stage-0
 Fetch Operator
 limit:-1
 Stage-1
 Reducer 6
 File Output Operator [FS_24]
 Group By Operator [GBY_22] (rows=1 width=128)
 Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
KEY._col0:0._col0)"]
 <-Reducer 5 [SIMPLE_EDGE]
 SHUFFLE [RS_21]
 Group By Operator [GBY_20] (rows=5619870 width=2328)
 Output:["_col0","_col1","_col2"],aggregations:["count(1)","count(DISTINCT 
_col0)"],keys:_col0
 Select Operator [SEL_17] (rows=5619870 width=2328)
 Output:["_col0"]
 <-Reducer 4 [SIMPLE_EDGE]
 SHUFFLE [RS_16]
 Select Operator [SEL_14] (rows=5619870 width=2328)
 Output:["_col0","_col3"]
 Filter Operator [FIL_25] (rows=5619870 width=2328)
 predicate:((_col0 / count_window_1) <= 0.8)
 PTF Operator [PTF_13] (rows=16859610 width=2328)
 Function definitions:[{},\{"name:":"windowingtablefunction","order by:":"_col1 
ASC NULLS FIRST","partition by:":"_col1"}]
 Select Operator [SEL_12] (rows=16859610 width=2328)
 Output:["_col0","_col1"]
 <-Reducer 3 [SIMPLE_EDGE]
 SHUFFLE [RS_11]
 PartitionCols:_col0
 Select Operator [SEL_10] (rows=16859610 width=2328)
 Output:["_col0","row_number_window_0"]
 PTF Operator [PTF_9] (rows=16859610 width=2328)
 Function definitions:[{},\{"name:":"windowingtablefunction","order by:":"_col2 
DESC NULLS LAST","partition by:":"_col0"}]
 Select Operator [SEL_8] (rows=16859610 width=2328)
 Output:["_col0","_col2"]
 <-Reducer 2 [SIMPLE_EDGE]
 SHUFFLE [RS_7]
 PartitionCols:_col0
 Select Operator [SEL_6] (rows=16859610 width=2328)
 Output:["_col0","_col2"]
 Group By Operator [GBY_5] (rows=16859610 width=2328)
 
Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0,
 KEY._col1
 <-Map 1 [SIMPLE_EDGE]
 SHUFFLE [RS_4]
 PartitionCols:_col0, _col1
 Group By Operator [GBY_3] (rows=33719220 width=2328)
 
Output:["_col0","_col1","_col2"],aggregations:["sum(payable_price)"],keys:store_code,
 all_pay_id
 Select Operator [SEL_2] (rows=33719220 width=2328)
 Output:["store_code","all_pay_id","payable_price"]
 TableScan [TS_0] (rows=33719220 width=2328)
 
data_promotion@mdw_user_promotion_shopping_behaivor_four_week_detail,a,Tbl:COMPLETE,Col:NONE,Output:["all_pay_id","store_code","payable_price"]

 

    Reducer 4 show PTF , so I don't know if this result has anything to do with 
PTF? Can you help me.

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (HIVE-22180) Same query but different result occasionally

2019-09-09 Thread qingfa zhou (Jira)
qingfa zhou created HIVE-22180:
--

 Summary: Same query but different result occasionally
 Key: HIVE-22180
 URL: https://issues.apache.org/jira/browse/HIVE-22180
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 2.3.0
 Environment: hadoop:2.7.3.2.6.1.0-129

hive:2.3.0

tez:0.8.4

java:1.8.0_212 
Reporter: qingfa zhou
 Attachments: image-2019-09-10-12-16-45-385.png, 
image-2019-09-10-12-17-34-779.png, image-2019-09-10-12-18-08-769.png, 
image-2019-09-10-12-26-41-681.png

h3. 1.Phenomenon

    Running the same SQL(hive on tez) several times will result in inconsistent 
results . The following results are generated respectively

    1)

     OK

     3951864 808

    2)

     OK

     1822979 353

    But the first result is true.
h3. 2.SQL

select
 count(1),count(distinct store_code)
 from 
 (
 select
 store_code,
 all_pay_id,
 payable_price,
 row_number()over(partition by store_code order by payable_price desc) as rw,
 count(1)over(partition by store_code) as store_user
 from 
 (
 select 
 store_code,
 all_pay_id,
 sum(payable_price) as payable_price
 from data_promotion.mdw_user_promotion_shopping_behaivor_four_week_detail a
 where dt='20190904'
 group by store_code,all_pay_id
 ) a 
 order by rw 
 ) a 
 where rw/store_user<=0.8
 ;

 
h3. 3.Troubleshooting

    I lookup task result from tez web ui,give the result as follows:

           the first: !image-2019-09-10-12-16-45-385.png!

        the second:

!image-2019-09-10-12-18-08-769.png!

      so , The second RECORDS_OUT_INTERMEDIATE_Reducer_4's result is much lower 
than the first.

     This is detailed info of the second result from reducer_3 to reducer_4 

!image-2019-09-10-12-26-41-681.png!  

This is execution plan of the sql:

OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
Reducer 6 <- Reducer 5 (SIMPLE_EDGE)

Stage-0
 Fetch Operator
 limit:-1
 Stage-1
 Reducer 6
 File Output Operator [FS_24]
 Group By Operator [GBY_22] (rows=1 width=128)
 Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
KEY._col0:0._col0)"]
 <-Reducer 5 [SIMPLE_EDGE]
 SHUFFLE [RS_21]
 Group By Operator [GBY_20] (rows=5619870 width=2328)
 Output:["_col0","_col1","_col2"],aggregations:["count(1)","count(DISTINCT 
_col0)"],keys:_col0
 Select Operator [SEL_17] (rows=5619870 width=2328)
 Output:["_col0"]
 <-Reducer 4 [SIMPLE_EDGE]
 SHUFFLE [RS_16]
 Select Operator [SEL_14] (rows=5619870 width=2328)
 Output:["_col0","_col3"]
 Filter Operator [FIL_25] (rows=5619870 width=2328)
 predicate:((_col0 / count_window_1) <= 0.8)
 PTF Operator [PTF_13] (rows=16859610 width=2328)
 Function definitions:[{},\{"name:":"windowingtablefunction","order by:":"_col1 
ASC NULLS FIRST","partition by:":"_col1"}]
 Select Operator [SEL_12] (rows=16859610 width=2328)
 Output:["_col0","_col1"]
 <-Reducer 3 [SIMPLE_EDGE]
 SHUFFLE [RS_11]
 PartitionCols:_col0
 Select Operator [SEL_10] (rows=16859610 width=2328)
 Output:["_col0","row_number_window_0"]
 PTF Operator [PTF_9] (rows=16859610 width=2328)
 Function definitions:[{},\{"name:":"windowingtablefunction","order by:":"_col2 
DESC NULLS LAST","partition by:":"_col0"}]
 Select Operator [SEL_8] (rows=16859610 width=2328)
 Output:["_col0","_col2"]
 <-Reducer 2 [SIMPLE_EDGE]
 SHUFFLE [RS_7]
 PartitionCols:_col0
 Select Operator [SEL_6] (rows=16859610 width=2328)
 Output:["_col0","_col2"]
 Group By Operator [GBY_5] (rows=16859610 width=2328)
 
Output:["_col0","_col1","_col2"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0,
 KEY._col1
 <-Map 1 [SIMPLE_EDGE]
 SHUFFLE [RS_4]
 PartitionCols:_col0, _col1
 Group By Operator [GBY_3] (rows=33719220 width=2328)
 
Output:["_col0","_col1","_col2"],aggregations:["sum(payable_price)"],keys:store_code,
 all_pay_id
 Select Operator [SEL_2] (rows=33719220 width=2328)
 Output:["store_code","all_pay_id","payable_price"]
 TableScan [TS_0] (rows=33719220 width=2328)
 
data_promotion@mdw_user_promotion_shopping_behaivor_four_week_detail,a,Tbl:COMPLETE,Col:NONE,Output:["all_pay_id","store_code","payable_price"]

 

    Reducer 4 show PTF , so I don't know if this result has anything to do with 
PTF? Can you help me.

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Review Request 71458: HIVE-22163

2019-09-09 Thread Jesús Camacho Rodríguez

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71458/#review217662
---




ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java
Lines 272 (patched)


I think _needColStats_ is only true when we call this method from CBO? What 
happens with CBO if estimate stats is set to false and stats for any column are 
not present, i.e., does it fail completely or only the stage that is dependent 
on the stats (for instance, join reordering)? I think the desired behavior is 
the latest. We can probably confirm that by checking the logs for 
join_reordering_no_stats.q test.


- Jesús Camacho Rodríguez


On Sept. 9, 2019, 5:29 p.m., Krisztian Kasa wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71458/
> ---
> 
> (Updated Sept. 9, 2019, 5:29 p.m.)
> 
> 
> Review request for hive, Gopal V, Jesús Camacho Rodríguez, Zoltan Haindrich, 
> and Vineet Garg.
> 
> 
> Bugs: HIVE-22163
> https://issues.apache.org/jira/browse/HIVE-22163
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> CBO: Enabling CBO turns on stats estimation, even when the estimation is 
> disabled
> 
> 
> Diffs
> -
> 
>   ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java 1795ae5626 
>   ql/src/test/queries/clientpositive/cbo_stats_estimation.q PRE-CREATION 
>   ql/src/test/results/clientpositive/cbo_stats_estimation.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/join_reordering_no_stats.q.out 
> fddffbb0a8 
> 
> 
> Diff: https://reviews.apache.org/r/71458/diff/1/
> 
> 
> Testing
> ---
> 
> New qtest: cbo_stats_estimation.q
> 
> 
> Thanks,
> 
> Krisztian Kasa
> 
>



Review Request 71458: HIVE-22163

2019-09-09 Thread Krisztian Kasa

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71458/
---

Review request for hive, Gopal V, Jesús Camacho Rodríguez, Zoltan Haindrich, 
and Vineet Garg.


Bugs: HIVE-22163
https://issues.apache.org/jira/browse/HIVE-22163


Repository: hive-git


Description
---

CBO: Enabling CBO turns on stats estimation, even when the estimation is 
disabled


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java 1795ae5626 
  ql/src/test/queries/clientpositive/cbo_stats_estimation.q PRE-CREATION 
  ql/src/test/results/clientpositive/cbo_stats_estimation.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/llap/join_reordering_no_stats.q.out 
fddffbb0a8 


Diff: https://reviews.apache.org/r/71458/diff/1/


Testing
---

New qtest: cbo_stats_estimation.q


Thanks,

Krisztian Kasa



Review Request 71456: select count gives incorrect result after loading data from text file

2019-09-09 Thread Attila Magyar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71456/
---

Review request for hive, Ashutosh Chauhan, Jesús Camacho Rodríguez, and Slim 
Bouguerra.


Bugs: HIVE-22055
https://issues.apache.org/jira/browse/HIVE-22055


Repository: hive-git


Description
---

This happens when tez.grouping.min-size is set to a small value (for example 1) 
so that the split size that is calculated from the file size is going to be 
used. This changes as the table grows and different split sizes will be used 
while doing each selects.

load 90 records from f1
select count(1) gives back 90
load 90 records from f2
select count(1) gives back 172 // 8 records missing


When running the second select the split size is larger, and 
SerDeLowLevelCacheImpl is already populated with stripes from the first select 
(and by that tiem split size was smaller).


There is problem with how LineRecordReader works togeather with the cache. So 
if a larger split is requested and an overlapping smaller one is already in the 
cache, then SerDeEncodedDataReader'll try to extend the existing split by 
reading the 
difference between the large and the small split. But it'll start reading right 
after the last stripe pyhsically ends,
and LineRecordReader always skips the first row, unless we are at the beginning 
of the file. So this line skipping behaviour is not considered at one point and 
that's why some rows are missing.


Diffs
-

  itests/src/test/resources/testconfiguration.properties 98280c52fe9 
  
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java
 462b25fa234 
  ql/src/test/queries/clientpositive/mm_loaddata_split_change.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/mm_loaddata_split_change.q.out 
PRE-CREATION 


Diff: https://reviews.apache.org/r/71456/diff/1/


Testing
---

with q test


Thanks,

Attila Magyar