[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566319#comment-16566319 ] Hive QA commented on HIVE-20260: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12933976/HIVE-20260.01.patch {color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 14842 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12991/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12991/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12991/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12933976 - PreCommit-HIVE-Build > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, > HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566298#comment-16566298 ] Hive QA commented on HIVE-20260: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 23s{color} | {color:blue} ql in master has 2302 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 6s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 39s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 fixed = 24 total (was 49) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 4 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 36s{color} | {color:red} ql generated 7 new + 2295 unchanged - 7 fixed = 2302 total (was 2302) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 26m 34s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 935] | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 956] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:[line 891] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:[line 935] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Long(String) constructor;
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566025#comment-16566025 ] Ashutosh Chauhan commented on HIVE-20260: - +1 > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, > HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16566014#comment-16566014 ] Zoltan Haindrich commented on HIVE-20260: - [~ashutoshc] sure, forgot to do that > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, > HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565997#comment-16565997 ] Ashutosh Chauhan commented on HIVE-20260: - Can you also update RB ? > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01.patch, HIVE-20260.01.patch, > HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565809#comment-16565809 ] Hive QA commented on HIVE-20260: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12933944/HIVE-20260.01.patch {color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 14839 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.metastore.TestMarkPartitionRemote.testMarkingPartitionSet (batchId=228) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12984/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12984/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12984/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12933944 - PreCommit-HIVE-Build > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01.patch, HIVE-20260.01wip01.patch, > HIVE-20260.01wip02.patch, HIVE-20260.01wip03.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565758#comment-16565758 ] Hive QA commented on HIVE-20260: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 10m 4s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 37s{color} | {color:blue} ql in master has 2301 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 10s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 40s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 fixed = 24 total (was 49) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 4 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 5m 0s{color} | {color:red} ql generated 7 new + 2294 unchanged - 7 fixed = 2301 total (was 2301) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 14s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 28m 14s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 935] | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 956] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:[line 891] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:[line 935] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Long(String) constructor;
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564056#comment-16564056 ] Hive QA commented on HIVE-20260: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12933789/HIVE-20260.01wip03.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 63 failed/errored test(s), 14838 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_auto_join1] (batchId=4) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] (batchId=154) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_join29] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_smb_mapjoin_14] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_9] (batchId=173) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez2] (batchId=158) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[constprog_semijoin] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[correlationoptimizer1] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[correlationoptimizer2] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[correlationoptimizer6] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_4] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[filter_join_breaktask] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_1] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[limit_pushdown] (batchId=174) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part1] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_create_rewrite_3] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[materialized_view_create_rewrite_rebuild_dummy] (batchId=163) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[mrr] (batchId=159) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[multiMapJoin2] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_llap] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_predicate_pushdown] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_predicate_pushdown] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_semijoin] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10_mm] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin6] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin7] (batchId=156) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[skewjoin] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_14] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists] (batchId=167) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in_having] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin] (batchId=174) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_select] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_views]
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564004#comment-16564004 ] Hive QA commented on HIVE-20260: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 55s{color} | {color:blue} ql in master has 2306 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 39s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 fixed = 24 total (was 49) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 4 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 10s{color} | {color:red} ql generated 7 new + 2299 unchanged - 7 fixed = 2306 total (was 2306) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 13s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 23m 45s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 935] | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 956] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:[line 891] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:[line 935] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Long(String) constructor;
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563292#comment-16563292 ] Zoltan Haindrich commented on HIVE-20260: - I feel that considering all column uncorreleated makes more sense than thinking about them as fully correlated. I've written a test (stat_estimate_drill.q) which have outputted much worse results without this patch especially in cases like the above example. The current patch does a little bit different than the above; but to be on the same page - I would like to show the old behaviour as well: {code} create table t1 (a int, b int, c int); -- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200 select a,b,c from t1 where a = 20 and b = 30; {code} old (uniform scaling) | | rowCount| ratio | ndv(a) | ndv(b) | ndv(c) | | | 500 || 100| 50 | 200 | | a=20 | 1/ndv(a) * rowCount = 5 | 1/ndv(a) = .01 | 1 | .5=>1 | 2 | | a=20 && b=30 | 1/ndv(b) * rowCount = 5 | 1/ndv(b) = 1 | 1 | 1 | 2 | the problem with the above is that it have lost diversity of column b and c. patch | | rowCount| ratio | ndv(a) | ndv(b) | ndv(c) | | | 500 || 100| 50 | 200 | | a=20 | 1/ndv(a) * rowCount = 5 | 1/ndv(a) = .01 | 1 | 50 * | 200 * | | a=20 && b=30 | 1/ndv(b) * rowCount = .1 => 1 | 1/ndv(b) = .02 | 1 | 1 | 200 * | I think it would make sense to limit ndv to rowcount at * places...since it's not possible to have that many anymore... About the second note: the patch already takes care of ands more-or-less correctly by clearing the affected columns when it start evaulating an And: https://github.com/apache/hive/compare/master...kgyrtkirk:HIVE-20260-stat-ndv#diff-11eb46db88b11b0c0fe63fb1a919f174R350 I think it would be possible to introduce a slider to enable to make this configurable; but I'm not sure if there are any people who would be wanting to change it...right now I think it would be better to use the uncorrelated model. > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562680#comment-16562680 ] Ashutosh Chauhan commented on HIVE-20260: - I am not sure this mechanism of {{affectedColumns}} will help. IIUC, effect of this is that column stats will get updated only for columns involved in filter expression. Columns which are part of filter operator but are absent from expressions will have their column stats unchanged. If so, that is not what we want. What we want is to update stats for all columns of filter operator which is the case before this patch but change the logic of how they are updated. Before the patch, every condition of filter results in scale down of col stats. That is we assumed each filter condition independently filters out diff rows. However, If instead we assume that diff filter conditions filter overlapping rows than we scale down col stats of column involved in condition only with row count decreased by it. Columns not involved in condition we can scale down col stats by max decrease of one of conditions. Perhaps, an example will help. Lets say we have : {code} create table t1 (a int, b int, c int); -- insert 500 rows. ndv(a) = 100 ndv(b) = 50 ndv(c) = 200 select a,b,c from t1 where a = 20 and b = 30; {code} Here, when we process a = 20; rowcount = 500 / 2 = 250. ndv(a) = 100 * (250/500) = 50. ndv(b) = 50 ndv(c) = 200. b and c's ndv unchanged. Then we process b = 30. rowcount = 250/2 = 125. ndv(b) = 50 * (125/250) = 25. ndv(a) = 50 ndv(c) = 200. a and c's ndv unchanged. For b and c we are done since we updated their column stats. For c (columns not included in filter condition) we updated with largest factor change brought. Here that means 200 * (1/2) = 100 Logic before this patch would have resulted in change of (125/500) = 1/4 ndv for every column. Apart from above, second issue is this scaling happen twice: once when filter expression is processed [1] and then when operator stats are updated[2] . That looks incorrect we should perhaps remove one of these calls. [1] : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L355 [2] : https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L299 > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562264#comment-16562264 ] Hive QA commented on HIVE-20260: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12933608/HIVE-20260.01wip02.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 59 failed/errored test(s), 14816 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_deep_filters] (batchId=95) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_drill] (batchId=13) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_related_col] (batchId=42) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] (batchId=154) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_join29] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_smb_mapjoin_14] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_9] (batchId=173) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez2] (batchId=158) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[constprog_semijoin] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_semijoin_reduction_sw2] (batchId=176) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_4] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[filter_join_breaktask] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_1] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part1] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=173) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_llap] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_predicate_pushdown] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_predicate_pushdown] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_semijoin] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[retry_failure_stat_changes] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10_mm] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin6] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin7] (batchId=156) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[skewjoin] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_14] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists] (batchId=167) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in_having] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin] (batchId=174) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_select] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_views] (batchId=159) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_dynpart_hashjoin_2] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_fixed_bucket_pruning] (batchId=176)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562214#comment-16562214 ] Hive QA commented on HIVE-20260: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 35s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 10s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 41s{color} | {color:blue} ql in master has 2297 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 42s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 fixed = 24 total (was 49) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 4 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 37s{color} | {color:red} ql generated 7 new + 2290 unchanged - 7 fixed = 2297 total (was 2297) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 13s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 26m 20s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 927] | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 948] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:[line 883] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:[line 927] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Long(String) constructor;
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562158#comment-16562158 ] Hive QA commented on HIVE-20260: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12933606/HIVE-20260.01wip01.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 59 failed/errored test(s), 14816 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_deep_filters] (batchId=95) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_drill] (batchId=13) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stat_estimate_related_col] (batchId=42) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[explainuser_2] (batchId=154) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[unionDistinct_1] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_join29] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_smb_mapjoin_14] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[auto_sortmerge_join_9] (batchId=173) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucket_map_join_tez2] (batchId=158) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[constprog_semijoin] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynamic_semijoin_reduction_sw2] (batchId=176) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_1] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[explainuser_4] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[filter_join_breaktask] (batchId=175) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[groupby_groupingset_bug] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_1] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[load_dyn_part1] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[offset_limit_ppd_optimizer] (batchId=173) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_llap] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[orc_predicate_pushdown] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[parquet_predicate_pushdown] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[reopt_semijoin] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[retry_failure_stat_changes] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sample10_mm] (batchId=170) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin6] (batchId=178) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin7] (batchId=156) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[semijoin] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[skewjoin] (batchId=161) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_14] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_exists] (batchId=167) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in] (batchId=172) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_in_having] (batchId=171) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_multi] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_notin] (batchId=174) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_select] (batchId=166) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_views] (batchId=159) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_dynpart_hashjoin_2] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_fixed_bucket_pruning] (batchId=176)
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562108#comment-16562108 ] Hive QA commented on HIVE-20260: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 33s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 7s{color} | {color:blue} ql in master has 2297 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s{color} | {color:red} ql: The patch generated 3 new + 21 unchanged - 28 fixed = 24 total (was 49) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 4 line(s) with tabs. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 4m 20s{color} | {color:red} ql generated 7 new + 2290 unchanged - 7 fixed = 2297 total (was 2297) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 13s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 57s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:ql | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 927] | | | Boxing/unboxing to parse a primitive org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) At StatsRulesProcFactory.java:[line 948] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Byte(String) constructor; use Byte.valueOf(String) instead At StatsRulesProcFactory.java:[line 883] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:inefficient new Integer(String) constructor; use Integer.valueOf(String) instead At StatsRulesProcFactory.java:[line 927] | | | org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$FilterStatsRule.evaluateComparator(Statistics, AnnotateStatsProcCtx, ExprNodeGenericFuncDesc, long) invokes inefficient new Long(String) constructor;
[jira] [Commented] (HIVE-20260) NDV of a column shouldn't be scaled when row count is changed by filter on another column
[ https://issues.apache.org/jira/browse/HIVE-20260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16562086#comment-16562086 ] Zoltan Haindrich commented on HIVE-20260: - I feel that this logic might need to be rethinked at some point...relying the calculation more on the the column stats - however I'm afraid that won't be possible since they might not be available all the time... I've introduced some logic to keep track of the affected columns; this makes it much better. However...a full test run is needed to see if it causes any trouble for other queries https://reviews.apache.org/r/68109/ > NDV of a column shouldn't be scaled when row count is changed by filter on > another column > - > > Key: HIVE-20260 > URL: https://issues.apache.org/jira/browse/HIVE-20260 > Project: Hive > Issue Type: Improvement > Components: Statistics >Reporter: Ashutosh Chauhan >Assignee: Zoltan Haindrich >Priority: Major > Attachments: HIVE-20260.01wip01.patch, HIVE-20260.01wip02.patch > > > HIVE-17465 introduced progressive scaling of rowcounts in presence of > multiple filters. HIVE-19500 improved on that by also scaling col stats (NDV) > in such scenario. However, it should pay attention to column used in filter > expression and not scale for all filters. eg., > consider filter a = 1 and b = 2 ndv of column b should not be scaled down by > row count changes caused by a = 1 > Other way to say this that ndv of a particular column should be updated at > the end of computation of row count for that operator. > Here are the possible cases where our estimates can be accurate (or close to) > {code} > case 1 - (d_year = 2001 and d_moy=1) > case 2 - (d_year = 2001 and d_year IN (2001, 2002)) > case 3 - (d_year = 2001 and d_moy = 1 and d_dom = 1) > case 4 - (d_date IN ('1999-01-02', '1999-01-02')) > case 5 - (d_date = '1999-01-01') > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)