[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-09-06 Thread Abhishek Somani (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606692#comment-16606692
 ] 

Abhishek Somani commented on HIVE-20220:


[~gopalv] [~ashutoshc] can someone please help on this one.

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-23 Thread Abhishek Somani (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591155#comment-16591155
 ] 

Abhishek Somani commented on HIVE-20220:


To add to Ganesha's point above, this is an issue with Tez as well. So this is 
a very pertinent issue.

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-23 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591150#comment-16591150
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

This issue can unknowingly lead to data loss or inconsistency as explained in 
the description. I would appreciate if someone can help with reviewing the fix. 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-14 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579582#comment-16579582
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

[~gopalv] I believe the hash mode aggregation can happen in mappers when map 
side aggregation (hive.map.aggr) is enabled.

Can we change the HashMap to LinkeHashMap at [this 
line|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java#L383]
 to maintain the order of keys?

 

Please let me know if I need to check for partial aggregation in any other 
place. 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576937#comment-16576937
 ] 

Hive QA commented on HIVE-20220:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12935173/HIVE-20220.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 14875 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/13155/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/13155/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-13155/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12935173 - PreCommit-HIVE-Build

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576910#comment-16576910
 ] 

Hive QA commented on HIVE-20220:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
10s{color} | {color:blue} ql in master has 2305 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
41s{color} | {color:red} ql: The patch generated 1 new + 55 unchanged - 0 fixed 
= 56 total (was 55) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 52s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-13155/dev-support/hive-personality.sh
 |
| git revision | master / 71b7f36 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-13155/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-13155/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select 

[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576722#comment-16576722
 ] 

Gopal V commented on HIVE-20220:


A deterministic rand() only works if there's no partial group-by above it (the 
retries don't return the rows in the same order, because for loops over 
HashMap).

So you'd have to make sure these keys are not partial aggregated before the 
RAND() is applied.

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576716#comment-16576716
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

[~ashutoshc] Please review the updated patch.

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.2.patch, HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576640#comment-16576640
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

[~ashutoshc] I just wanted to make sure that things are fine if we use 
deterministic rand, so added it under a config. I'll make the changes to always 
call rand function with seed. 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-10 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16576571#comment-16576571
 ] 

Ashutosh Chauhan commented on HIVE-20220:
-

Is there a reason for config? I think deterministic rand() is always useful. 
So, perhaps we can get rid of config and always do this.

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-08-01 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564990#comment-16564990
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

Please help with adding the concerned folks to review this fix. 

Cc: [~ashutoshc] [~thejas] [~jcamachorodriguez]

 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-31 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563839#comment-16563839
 ] 

Hive QA commented on HIVE-20220:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933771/HIVE-20220.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 14837 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12965/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12965/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12965/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12933771 - PreCommit-HIVE-Build

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-31 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563799#comment-16563799
 ] 

Hive QA commented on HIVE-20220:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
 0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
21s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
34s{color} | {color:blue} common in master has 64 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
54s{color} | {color:blue} ql in master has 2306 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
24s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
40s{color} | {color:red} ql: The patch generated 1 new + 55 unchanged - 0 fixed 
= 56 total (was 55) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 27m 16s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12965/dev-support/hive-personality.sh
 |
| git revision | master / 48ce7ce |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12965/yetus/diff-checkstyle-ql.txt
 |
| modules | C: common ql U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12965/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now 

[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-31 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563720#comment-16563720
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

Fixed checkstyle errors. Other test failures are not related to the change 
done. 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-31 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563461#comment-16563461
 ] 

Hive QA commented on HIVE-20220:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933695/HIVE-20220.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 14828 tests 
executed
*Failed tests:*
{noformat}
TestMiniDruidCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=192)

[druidmini_dynamic_partition.q,druidmini_test_ts.q,druidmini_expressions.q,druidmini_test_alter.q,druidmini_test_insert.q]
org.apache.hive.service.cli.thrift.TestThriftCLIServiceWithBinary.org.apache.hive.service.cli.thrift.TestThriftCLIServiceWithBinary
 (batchId=246)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12962/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12962/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12962/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12933695 - PreCommit-HIVE-Build

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-31 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563429#comment-16563429
 ] 

Hive QA commented on HIVE-20220:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
46s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
26s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
33s{color} | {color:blue} common in master has 64 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
21s{color} | {color:blue} ql in master has 2306 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
16s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
26s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
16s{color} | {color:red} common: The patch generated 1 new + 423 unchanged - 0 
fixed = 424 total (was 423) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
41s{color} | {color:red} ql: The patch generated 1 new + 55 unchanged - 0 fixed 
= 56 total (was 55) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 28m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12962/dev-support/hive-personality.sh
 |
| git revision | master / edc53cc |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12962/yetus/diff-checkstyle-common.txt
 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12962/yetus/diff-checkstyle-ql.txt
 |
| modules | C: common ql U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12962/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty 

[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-30 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563069#comment-16563069
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

RB request: https://reviews.apache.org/r/68121/

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-30 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563051#comment-16563051
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

Fixed checkstyle errors. 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-30 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561469#comment-16561469
 ] 

Hive QA commented on HIVE-20220:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933540/HIVE-20220.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 14817 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12938/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12938/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12938/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12933540 - PreCommit-HIVE-Build

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-30 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16561455#comment-16561455
 ] 

Hive QA commented on HIVE-20220:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
39s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
51s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
31s{color} | {color:blue} common in master has 64 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
57s{color} | {color:blue} ql in master has 2297 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
9s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
16s{color} | {color:red} common: The patch generated 2 new + 423 unchanged - 0 
fixed = 425 total (was 423) {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
38s{color} | {color:red} ql: The patch generated 1 new + 55 unchanged - 0 fixed 
= 56 total (was 55) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 26m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12938/dev-support/hive-personality.sh
 |
| git revision | master / 83e5397 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12938/yetus/diff-checkstyle-common.txt
 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12938/yetus/diff-checkstyle-ql.txt
 |
| modules | C: common ql U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12938/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty 

[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-24 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555068#comment-16555068
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

Can someone review this patch please? Please let me know if there is a better 
way of fixing this. I can update the qtest files based on the fix. 

 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-23 Thread Ganesha Shreedhara (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552606#comment-16552606
 ] 

Ganesha Shreedhara commented on HIVE-20220:
---

I'll correct the golden files if this fix is feasible. 

> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test group by id;
> //Remove mapper's intermediate output file when map stage is completed and 
> one out of 2 reduce tasks is completed and then continue the run. This causes 
> 2nd reducer to send event to Application Master to rerun the map task. 
> The following is the expected result. 
> 1
> 2
> 3
> 4
> 5
> 6
> 8
> 8
> 9 
>  
> But you may get different result due to a different value returned by the 
> rand function in the second run causing different distribution of keys.
> This needs to be fixed such that the mapper distributes the same keys even if 
> it is reattempted multiple times. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552464#comment-16552464
 ] 

Hive QA commented on HIVE-20220:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12932648/HIVE-20220.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 48 failed/errored test(s), 14681 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[autoColumnStats_7] 
(batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=65)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby11] (batchId=76)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby1] (batchId=19)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby1_map_skew] 
(batchId=66)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby3] (batchId=6)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby4] (batchId=64)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby5] (batchId=42)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby6] (batchId=57)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby6_map_skew] 
(batchId=43)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby7_map_skew] 
(batchId=46)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby8] (batchId=78)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] 
(batchId=4)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_rollup1] 
(batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_sort_skew_1_23] 
(batchId=9)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_distinct] 
(batchId=59)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nullgroup2] (batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[nullgroup] (batchId=75)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union33] (batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_groupby4] 
(batchId=16)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_groupby6] 
(batchId=92)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby1] 
(batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby2] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby3] 
(batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_resolution]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_groupby4]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_groupby6]
 (batchId=179)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_groupby_cube1]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_groupby_grouping_id2]
 (batchId=173)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_groupby_rollup1]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1]
 (batchId=187)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby1] 
(batchId=116)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby1_map_skew] 
(batchId=137)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby4] 
(batchId=136)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby5] 
(batchId=126)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby6] 
(batchId=133)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby6_map_skew] 
(batchId=127)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby7_map_skew] 
(batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby_cube1] 
(batchId=109)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby_resolution] 
(batchId=127)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby_rollup1] 
(batchId=123)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[groupby_sort_skew_1_23]
 (batchId=112)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[mapjoin_distinct] 
(batchId=134)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[nullgroup2] 
(batchId=131)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[nullgroup] 
(batchId=141)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union33] 
(batchId=120)
org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 (batchId=241)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12788/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12788/console
Test logs: 

[jira] [Commented] (HIVE-20220) Incorrect result when hive.groupby.skewindata is enabled

2018-07-23 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552433#comment-16552433
 ] 

Hive QA commented on HIVE-20220:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
48s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
51s{color} | {color:blue} ql in master has 2280 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
41s{color} | {color:red} ql: The patch generated 1 new + 55 unchanged - 0 fixed 
= 56 total (was 55) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 38s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12788/dev-support/hive-personality.sh
 |
| git revision | master / 170a012 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12788/yetus/diff-checkstyle-ql.txt
 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12788/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Incorrect result when hive.groupby.skewindata is enabled
> 
>
> Key: HIVE-20220
> URL: https://issues.apache.org/jira/browse/HIVE-20220
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.0.0
>Reporter: Ganesha Shreedhara
>Assignee: Ganesha Shreedhara
>Priority: Major
> Attachments: HIVE-20220.patch
>
>
> hive.groupby.skewindata makes use of rand UDF to randomly distribute grouped 
> by keys to the reducers and hence avoids overloading a single reducer when 
> there is a skew in data. 
> This random distribution of keys is buggy when the reducer fails to fetch the 
> mapper output due to a faulty datanode or any other reason. When reducer 
> finds that it can't fetch mapper output, it sends a signal to Application 
> Master to reattempt the corresponding map task. The reattempted map task will 
> now get the different random value from rand function and hence the keys that 
> gets distributed now to the reducer will not be same as the previous run. 
>  
> *Steps to reproduce:*
> create table test(id int);
> insert into test values 
> (1),(2),(2),(3),(3),(3),(4),(4),(4),(4),(5),(5),(5),(5),(5),(6),(6),(6),(6),(6),(6),(7),(7),(7),(7),(7),(7),(7),(7),(8),(8),(8),(8),(8),(8),(8),(8),(9),(9),(9),(9),(9),(9),(9),(9),(9);
> SET hive.groupby.skewindata=true;
> SET mapreduce.reduce.reduces=2;
> //Add a debug port for reducer
> select count(1) from test