[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566319#comment-16566319
 ] 

Hive QA commented on HIVE-20260:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933976/HIVE-20260.01.patch

{color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 14842 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12991/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12991/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12991/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12933976 - PreCommit-HIVE-Build

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, 
> HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566298#comment-16566298
 ] 

Hive QA commented on HIVE-20260:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
23s{color} | {color:blue} ql in master has 2302 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
6s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
39s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 
fixed = 24 total (was 49) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 4 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  4m 
36s{color} | {color:red} ql generated 7 new + 2295 unchanged - 7 fixed = 2302 
total (was 2302) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
15s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 26m 34s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 935] |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 956] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Byte(String) constructor; use Byte.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use 
Byte.valueOf(String) instead  At StatsRulesProcFactory.java:[line 891] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Integer(String) constructor; use Integer.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use 
Integer.valueOf(String) instead  At StatsRulesProcFactory.java:[line 935] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Long(String) constructor; 

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566025#comment-16566025
 ] 

Ashutosh Chauhan commented on HIVE-20260:
-

+1

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, 
> HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Zoltan Haindrich (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566014#comment-16566014
 ] 

Zoltan Haindrich commented on HIVE-20260:
-

[~ashutoshc] sure, forgot to do that

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, 
> HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565997#comment-16565997
 ] 

Ashutosh Chauhan commented on HIVE-20260:
-

Can you also update RB ?

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, 
> HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565809#comment-16565809
 ] 

Hive QA commented on HIVE-20260:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933944/HIVE-20260.01.patch

{color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 14839 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.metastore.TestMarkPartitionRemote.testMarkingPartitionSet
 (batchId=228)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12984/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12984/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12984/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12933944 - PreCommit-HIVE-Build

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01.patch, HIVE-20260.01wip01.patch, 
> HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-08-01 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565758#comment-16565758
 ] 

Hive QA commented on HIVE-20260:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 10m 
 4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
15s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
37s{color} | {color:blue} ql in master has 2301 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
40s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 
fixed = 24 total (was 49) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 4 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  5m  
0s{color} | {color:red} ql generated 7 new + 2294 unchanged - 7 fixed = 2301 
total (was 2301) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 28m 14s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 935] |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 956] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Byte(String) constructor; use Byte.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use 
Byte.valueOf(String) instead  At StatsRulesProcFactory.java:[line 891] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Integer(String) constructor; use Integer.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use 
Integer.valueOf(String) instead  At StatsRulesProcFactory.java:[line 935] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Long(String) constructor; 

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-31 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564056#comment-16564056
 ] 

Hive QA commented on HIVE-20260:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933789/HIVE-20260.01wip03.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 63 failed/errored test(s), 14838 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_auto_join1] 
(batchId=4)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] 
(batchId=154)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_join29]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_smb_mapjoin_14]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_9]
 (batchId=173)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez2]
 (batchId=158)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[constprog_semijoin]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[correlationoptimizer1]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[correlationoptimizer2]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[correlationoptimizer6]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_4]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[filter_join_breaktask]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_1]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[limit_pushdown]
 (batchId=174)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part1]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_create_rewrite_3]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_create_rewrite_rebuild_dummy]
 (batchId=163)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[mrr] 
(batchId=159)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[multiMapJoin2]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_llap] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_predicate_pushdown]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_predicate_pushdown]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_semijoin]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10_mm]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin6] 
(batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin7] 
(batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[skewjoin] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_14]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in_having]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin]
 (batchId=174)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_select]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_views]
 

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-31 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564004#comment-16564004
 ] 

Hive QA commented on HIVE-20260:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
55s{color} | {color:blue} ql in master has 2306 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
39s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 
fixed = 24 total (was 49) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 4 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  4m 
10s{color} | {color:red} ql generated 7 new + 2299 unchanged - 7 fixed = 2306 
total (was 2306) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 935] |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 956] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Byte(String) constructor; use Byte.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use 
Byte.valueOf(String) instead  At StatsRulesProcFactory.java:[line 891] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Integer(String) constructor; use Integer.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use 
Integer.valueOf(String) instead  At StatsRulesProcFactory.java:[line 935] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Long(String) constructor; 

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-31 Thread Zoltan Haindrich (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563292#comment-16563292
 ] 

Zoltan Haindrich commented on HIVE-20260:
-

I feel that considering all column uncorreleated makes more sense than thinking 
about them as fully correlated. I've written a test (stat_estimate_drill.q) 
which have outputted much worse results without this patch especially in cases 
like the above example.

The current patch does a little bit different than the above; but to be on the 
same page - I would like to show the old behaviour as well:
{code}
create table t1 (a int, b int, c int);
-- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200
select a,b,c from t1 where a = 20 and b = 30;
{code}

old (uniform scaling)
|  | rowCount| ratio  | ndv(a) | ndv(b) | 
ndv(c) |
|  | 500 || 100| 50 | 
200 |
| a=20 | 1/ndv(a) * rowCount = 5 | 1/ndv(a) = .01 | 1  | .5=>1  | 2 
|
| a=20 && b=30 | 1/ndv(b) * rowCount = 5 | 1/ndv(b) = 1   | 1  | 1  | 2 
|

the problem with the above is that it have lost diversity of column b and c.

patch
|  | rowCount| ratio  | ndv(a) | ndv(b) | 
ndv(c) |
|  | 500 || 100| 50 | 
200 |
| a=20 | 1/ndv(a) * rowCount = 5 | 1/ndv(a) = .01 | 1  | 50  *   | 
200 * |
| a=20 && b=30 | 1/ndv(b) * rowCount = .1 => 1 | 1/ndv(b) = .02   | 1  | 1  
| 200 * |

I think it would make sense to limit ndv to rowcount at * places...since it's 
not possible to have that many anymore...

About the second note: the patch already takes care of ands more-or-less 
correctly by clearing the affected columns when it start evaulating an And:
https://github.com/apache/hive/compare/master...kgyrtkirk:HIVE-20260-stat-ndv#diff-11eb46db88b11b0c0fe63fb1a919f174R350

I think it would be possible to introduce a slider to enable to make this 
configurable; but I'm not sure if there are any people who would be wanting to 
change it...right now I think it would be better to use the uncorrelated model.


> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-30 Thread Ashutosh Chauhan (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562680#comment-16562680
 ] 

Ashutosh Chauhan commented on HIVE-20260:
-

I am not sure this mechanism of {{affectedColumns}} will help. IIUC, effect of 
this is that column stats will get updated only for columns involved in filter 
expression. Columns which are part of filter operator but are absent from 
expressions will have their column stats unchanged. If so, that is not what we 
want. What we want is to update stats for all columns of filter operator which 
is the case before this patch but change the logic of how they are updated. 
Before the patch, every condition of filter results in scale down of col stats. 
That is we assumed each filter condition independently filters out diff rows. 
However, If instead we assume that diff filter conditions filter overlapping 
rows than we scale down col stats of column involved in condition only with row 
count decreased by it. Columns not involved in condition we can scale down col 
stats by max decrease of one of conditions.
Perhaps, an example will help. Lets say we have :
{code}
create table t1 (a int, b int, c int);
-- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200
select a,b,c from t1 where a = 20 and b = 30;
{code}

Here, when we process a = 20; 
rowcount = 500 / 2 = 250. ndv(a) = 100 * (250/500) = 50. ndv(b) = 50 ndv(c) = 
200. b and c's ndv unchanged.
Then we process b = 30.
rowcount = 250/2 = 125. ndv(b) = 50 * (125/250) = 25. ndv(a) = 50 ndv(c) = 200. 
a and c's ndv unchanged.

For b and c we are done since we updated their column stats. For c (columns not 
included in filter condition) we updated with largest factor change brought. 
Here that means 200 * (1/2) = 100

Logic before this patch would have resulted in change of (125/500) = 1/4 ndv 
for every column. 

Apart from above, second issue is this scaling happen twice: once when filter 
expression is processed [1] and then when operator stats are updated[2] . That 
looks incorrect we should perhaps remove one of these calls.
[1] : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L355
[2] : 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L299

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-30 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562264#comment-16562264
 ] 

Hive QA commented on HIVE-20260:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933608/HIVE-20260.01wip02.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 59 failed/errored test(s), 14816 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_deep_filters]
 (batchId=95)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_drill] 
(batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_related_col]
 (batchId=42)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] 
(batchId=154)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_join29]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_smb_mapjoin_14]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_9]
 (batchId=173)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez2]
 (batchId=158)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[constprog_semijoin]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_semijoin_reduction_sw2]
 (batchId=176)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_4]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[filter_join_breaktask]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_1]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part1]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=173)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_llap] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_predicate_pushdown]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_predicate_pushdown]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_semijoin]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[retry_failure_stat_changes]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10_mm]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin6] 
(batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin7] 
(batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[skewjoin] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_14]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in_having]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin]
 (batchId=174)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_select]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_views]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_dynpart_hashjoin_2]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_fixed_bucket_pruning]
 (batchId=176)

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-30 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562214#comment-16562214
 ] 

Hive QA commented on HIVE-20260:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m 
41s{color} | {color:blue} ql in master has 2297 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
42s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 
fixed = 24 total (was 49) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 4 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  4m 
37s{color} | {color:red} ql generated 7 new + 2290 unchanged - 7 fixed = 2297 
total (was 2297) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 26m 20s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 927] |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 948] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Byte(String) constructor; use Byte.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use 
Byte.valueOf(String) instead  At StatsRulesProcFactory.java:[line 883] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Integer(String) constructor; use Integer.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use 
Integer.valueOf(String) instead  At StatsRulesProcFactory.java:[line 927] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Long(String) constructor; 

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-30 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562158#comment-16562158
 ] 

Hive QA commented on HIVE-20260:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933606/HIVE-20260.01wip01.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 59 failed/errored test(s), 14816 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_deep_filters]
 (batchId=95)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_drill] 
(batchId=13)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_related_col]
 (batchId=42)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] 
(batchId=154)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] 
(batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_join29]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_smb_mapjoin_14]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_9]
 (batchId=173)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez2]
 (batchId=158)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[constprog_semijoin]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_semijoin_reduction_sw2]
 (batchId=176)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization]
 (batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_4]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[filter_join_breaktask]
 (batchId=175)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_1]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part1]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer]
 (batchId=173)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_llap] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_predicate_pushdown]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_predicate_pushdown]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_semijoin]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[retry_failure_stat_changes]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10] 
(batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10_mm]
 (batchId=170)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin6] 
(batchId=178)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin7] 
(batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[skewjoin] 
(batchId=161)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_14]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists]
 (batchId=167)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in]
 (batchId=172)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in_having]
 (batchId=171)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin]
 (batchId=174)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_select]
 (batchId=166)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_views]
 (batchId=159)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_dynpart_hashjoin_2]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_fixed_bucket_pruning]
 (batchId=176)

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-30 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562108#comment-16562108
 ] 

Hive QA commented on HIVE-20260:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  4m  
7s{color} | {color:blue} ql in master has 2297 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
41s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 
fixed = 24 total (was 49) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch 4 line(s) with tabs. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  4m 
20s{color} | {color:red} ql generated 7 new + 2290 unchanged - 7 fixed = 2297 
total (was 2297) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 57s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:ql |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 927] |
|  |  Boxing/unboxing to parse a primitive 
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long)  At 
StatsRulesProcFactory.java:[line 948] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Byte(String) constructor; use Byte.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use 
Byte.valueOf(String) instead  At StatsRulesProcFactory.java:[line 883] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Integer(String) constructor; use Integer.valueOf(String) instead  At 
StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use 
Integer.valueOf(String) instead  At StatsRulesProcFactory.java:[line 927] |
|  |  
org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics,
 AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new 
Long(String) constructor; 

[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column

2018-07-30 Thread Zoltan Haindrich (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562086#comment-16562086
 ] 

Zoltan Haindrich commented on HIVE-20260:
-

I feel that this logic might need to be rethinked at some point...relying the 
calculation more on the the column stats - however I'm afraid that won't be 
possible since they might not be available all the time...

I've introduced some logic to keep track of the affected columns; this makes it 
much better. However...a full test run is needed to see if it causes any 
trouble for other queries
https://reviews.apache.org/r/68109/

> NDV of a column shouldn't be scaled when row count is changed by filter on 
> another column
> -
>
> Key: HIVE-20260
> URL: https://issues.apache.org/jira/browse/HIVE-20260
> Project: Hive
>  Issue Type: Improvement
>  Components: Statistics
>Reporter: Ashutosh Chauhan
>Assignee: Zoltan Haindrich
>Priority: Major
> Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch
>
>
> HIVE-17465 introduced progressive scaling of rowcounts in presence of 
> multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) 
> in such scenario. However, it should pay attention to column used in filter 
> expression and not scale for all filters. eg.,
> consider filter a = 1 and b = 2 ndv of column b should not be scaled down by 
> row count changes caused by a = 1
> Other way to say this that ndv of a particular column should be updated at 
> the end of computation of row count for that operator.
> Here are the possible cases where our estimates can be accurate (or close to)
> {code}
> case 1 - (d_year = 2001 and d_moy=1)
> case 2 - (d_year = 2001 and d_year IN (2001, 2002))
> case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1)
> case 4 - (d_date IN ('1999-01-02', '1999-01-02'))
> case 5 - (d_date = '1999-01-01')
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)