[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420630#comment-17420630 ] ASF subversion and git services commented on IMPALA-7560: - Commit 8862719d87ac5dc214985025463f002d41b15672 in impala's branch refs/heads/branch-4.0.1 from liuyao [ https://gitbox.apache.org/repos/asf?p=impala.git;h=8862719 ] IMPALA-7560: Set selectivity of Not-equal Calculate binary predicate selectivity if one of the children is a slotref and the other children are all constant. eg. something like "col != 5", but not "2 * col != 10" selectivity = 1 - 1/ndv Testing: Modify the function testNeSelectivity() of the ExprCardinalityTest.java, change -1 to the correct value. Change-Id: Icd6f5945840ea2a8194d72aa440ddfa6915cbb3a Reviewed-on: http://gerrit.cloudera.org:8080/17344 Reviewed-by: Qifan Chen Tested-by: Impala Public Jenkins Reviewed-by: Zoltan Borok-Nagy > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: Bharath Vissapragada >Assignee: liuyao >Priority: Major > Fix For: Impala 4.1.0 > > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368908#comment-17368908 ] ASF subversion and git services commented on IMPALA-7560: - Commit 2aa199cc0dbb7c3b59016c855e24f0842d6b261b in impala's branch refs/heads/master from liuyao [ https://gitbox.apache.org/repos/asf?p=impala.git;h=2aa199c ] IMPALA-7560: Set selectivity of Not-equal Calculate binary predicate selectivity if one of the children is a slotref and the other children are all constant. eg. something like "col != 5", but not "2 * col != 10" selectivity = 1 - 1/ndv Testing: Modify the function testNeSelectivity() of the ExprCardinalityTest.java, change -1 to the correct value. Change-Id: Icd6f5945840ea2a8194d72aa440ddfa6915cbb3a Reviewed-on: http://gerrit.cloudera.org:8080/17344 Reviewed-by: Qifan Chen Tested-by: Impala Public Jenkins Reviewed-by: Zoltan Borok-Nagy > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: Bharath Vissapragada >Assignee: liuyao >Priority: Major > Fix For: Impala 4.0 > > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367827#comment-17367827 ] liuyao commented on IMPALA-7560: Calculation formula: ( 1 - 1 / ndv ) * "non-null values caridinality" / "all values caridinality" > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: Bharath Vissapragada >Assignee: liuyao >Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361434#comment-17361434 ] liuyao commented on IMPALA-7560: https://gerrit.cloudera.org/#/c/17344/ > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: Bharath Vissapragada >Assignee: liuyao >Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624019#comment-16624019 ] Paul Rogers commented on IMPALA-7560: - Created a unit test for this. {noformat} runTest("SELECT id FROM functional.alltypes WHERE int_col = 1;", 664); runTest("SELECT id FROM functional.alltypes WHERE int_col != 1", 730); {noformat} This says, given the query in the first argument, verify the estimated cardinality given by the second argument. The numbers shown above are those that Impala computes today. The NDV for {{int_col}} is 11 (10 from data + 1 for null after recent fix). The row count is 7300. The {{!=}} expression uses the classic 0.1 selectivity estimate. However, as this ticket suggests, if we know the NDV, then a better estimate is based on 1 - 1/NDV. By the way, we use the same estimate for inequality: {noformat} runTest("SELECT id FROM functional.alltypes WHERE int_col > 1", 730); {noformat} IMPALA-7601 describes how other engines use different guesses for inequality: 0.3 is a common estimate. (NDV is not helpful for inequalities.) > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: bharath v >Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620946#comment-16620946 ] Paul Rogers commented on IMPALA-7560: - The table in DRILL-5254 suggests how to use the NDV value to compute other selectivity (reduction) factors such as for IN, <=, >= and so on. It is hard to compute < and > without histograms. Even = and != are hard since, if there are 5 NDVs, it is not clear if they are all 20% of the data, or one is 96% and the others are 1% each. For example, consider HTTP status codes in a web log. There may be, say 10 distinct codes, but code 200 accounts for the vast majority of records (in a healthy server.) > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: bharath v >Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620017#comment-16620017 ] bharath v commented on IMPALA-7560: --- Agreed [~Paul.Rogers]. Impala tables can include stats (NDVs computed by users), so we can use similar math here to compute <>'s selectivity. > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: bharath v >Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate
[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619827#comment-16619827 ] Paul Rogers commented on IMPALA-7560: - Turns out that Apache Drill did a similar analysis to work out rules based on the classic defaults plus some reasoning about probability: DRILL-5254 For Drill, since only the "classic" estimates (not stats) are available, the probabilities don't work out because of he conditional probability of a user using one operator vs. another. But, the math reasoning might be used here if we do have stats to work with. > Better selectivity estimate for != (not equals) binary predicate > > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 >Reporter: bharath v >Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference slotRefRef = new Reference(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--++--+--+---++---+---+-+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--++--+--+---++---+---+-+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20*| *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--++--+--+---++---+---+-+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org