[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2021-09-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420630#comment-17420630
 ] 

ASF subversion and git services commented on IMPALA-7560:
-

Commit 8862719d87ac5dc214985025463f002d41b15672 in impala's branch 
refs/heads/branch-4.0.1 from liuyao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=8862719 ]

IMPALA-7560: Set selectivity of Not-equal

Calculate binary predicate selectivity if one of the children is
a slotref and the other children are all constant.
eg. something like "col != 5", but not "2 * col != 10"

selectivity = 1 - 1/ndv

Testing:
Modify the function testNeSelectivity() of the
ExprCardinalityTest.java, change -1 to the correct value.

Change-Id: Icd6f5945840ea2a8194d72aa440ddfa6915cbb3a
Reviewed-on: http://gerrit.cloudera.org:8080/17344
Reviewed-by: Qifan Chen 
Tested-by: Impala Public Jenkins 
Reviewed-by: Zoltan Borok-Nagy 


> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: Bharath Vissapragada
>Assignee: liuyao
>Priority: Major
> Fix For: Impala 4.1.0
>
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2021-06-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368908#comment-17368908
 ] 

ASF subversion and git services commented on IMPALA-7560:
-

Commit 2aa199cc0dbb7c3b59016c855e24f0842d6b261b in impala's branch 
refs/heads/master from liuyao
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2aa199c ]

IMPALA-7560: Set selectivity of Not-equal

Calculate binary predicate selectivity if one of the children is
a slotref and the other children are all constant.
eg. something like "col != 5", but not "2 * col != 10"

selectivity = 1 - 1/ndv

Testing:
Modify the function testNeSelectivity() of the
ExprCardinalityTest.java, change -1 to the correct value.

Change-Id: Icd6f5945840ea2a8194d72aa440ddfa6915cbb3a
Reviewed-on: http://gerrit.cloudera.org:8080/17344
Reviewed-by: Qifan Chen 
Tested-by: Impala Public Jenkins 
Reviewed-by: Zoltan Borok-Nagy 


> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: Bharath Vissapragada
>Assignee: liuyao
>Priority: Major
> Fix For: Impala 4.0
>
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2021-06-22 Thread liuyao (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367827#comment-17367827
 ] 

liuyao commented on IMPALA-7560:


 
Calculation formula:
( 1 - 1 / ndv ) * "non-null values caridinality" /  "all values caridinality"

> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: Bharath Vissapragada
>Assignee: liuyao
>Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2021-06-11 Thread liuyao (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361434#comment-17361434
 ] 

liuyao commented on IMPALA-7560:


https://gerrit.cloudera.org/#/c/17344/

> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: Bharath Vissapragada
>Assignee: liuyao
>Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2018-09-21 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16624019#comment-16624019
 ] 

Paul Rogers commented on IMPALA-7560:
-

Created a unit test for this.
{noformat}
runTest("SELECT id FROM functional.alltypes WHERE int_col = 1;", 664);
runTest("SELECT id FROM functional.alltypes WHERE int_col != 1", 730);
{noformat}
This says, given the query in the first argument, verify the estimated 
cardinality given by the second argument. The numbers shown above are those 
that Impala computes today.

The NDV for {{int_col}} is 11 (10 from data + 1 for null after recent fix). The 
row count is 7300.

The {{!=}} expression uses the classic 0.1 selectivity estimate. However, as 
this ticket suggests, if we know the NDV, then a better estimate is based on 1 
- 1/NDV.

By the way, we use the same estimate for inequality:
{noformat}
runTest("SELECT id FROM functional.alltypes WHERE int_col > 1", 730);
{noformat}
IMPALA-7601 describes how other engines use different guesses for inequality: 
0.3 is a common estimate. (NDV is not helpful for inequalities.)

> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: bharath v
>Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620946#comment-16620946
 ] 

Paul Rogers commented on IMPALA-7560:
-

The table in DRILL-5254 suggests how to use the NDV value to compute other 
selectivity (reduction) factors such as for IN, <=, >= and so on.

It is hard to compute < and > without histograms. Even = and != are hard since, 
if there are 5 NDVs, it is not clear if they are all 20% of the data, or one is 
96% and the others are 1% each.

For example, consider HTTP status codes in a web log. There may be, say 10 
distinct codes, but code 200 accounts for the vast majority of records (in a 
healthy server.)

> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: bharath v
>Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2018-09-18 Thread bharath v (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620017#comment-16620017
 ] 

bharath v commented on IMPALA-7560:
---

Agreed [~Paul.Rogers]. Impala tables can include stats (NDVs computed by 
users), so we can use similar math here to compute <>'s selectivity. 

> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: bharath v
>Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7560) Better selectivity estimate for != (not equals) binary predicate

2018-09-18 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619827#comment-16619827
 ] 

Paul Rogers commented on IMPALA-7560:
-

Turns out that Apache Drill did a similar analysis to work out rules based on 
the classic defaults plus some reasoning about probability: DRILL-5254

For Drill, since only the "classic" estimates (not stats) are available, the 
probabilities don't work out because of he conditional probability of a user 
using one operator vs. another. But, the math reasoning might be used here if 
we do have stats to work with.


> Better selectivity estimate for != (not equals) binary predicate
> 
>
> Key: IMPALA-7560
> URL: https://issues.apache.org/jira/browse/IMPALA-7560
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 
> 2.12.0, Impala 2.13.0
>Reporter: bharath v
>Priority: Major
>
> Currently we use the default selectivity estimate for any binary predicate 
> with op other than EQ / NON_DISTINCT.
> {noformat}
> // Determine selectivity
> // TODO: Compute selectivity for nested predicates.
> // TODO: Improve estimation using histograms.
> Reference slotRefRef = new Reference();
> if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
> && isSingleColumnPredicate(slotRefRef, null)) {
>   long distinctValues = slotRefRef.getRef().getNumDistinctValues();
>   if (distinctValues > 0) {
> selectivity_ = 1.0 / distinctValues;
> selectivity_ = Math.max(0, Math.min(1, selectivity_));
>   }
> }
> {noformat}
> This can give very conservative estimates. For example:
> {noformat}
> [localhost:21000] tpch> select * from nation where n_regionkey != 1;
> [localhost:21000] tpch> summary;
> +--++--+--+---++---+---+-+
> | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak 
> Mem  | Est. Peak Mem | Detail  |
> +--++--+--+---++---+---+-+
> | 00:SCAN HDFS | 1  | 3.32ms   | 3.32ms   | *20*| *3*  | 
> 143.00 KB | 16.00 MB  | tpch.nation |
> +--++--+--+---++---+---+-+
> [localhost:21000] tpch> 
> {noformat}
> Ideally we could've inversed the selecitivity  to 4/5 (=1 - 1/5) that can 
> give better estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org