Paul Rogers created IMPALA-7601:
-----------------------------------

             Summary: Define a-priori selectivity and NDV values
                 Key: IMPALA-7601
                 URL: https://issues.apache.org/jira/browse/IMPALA-7601
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.12.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


Impala makes extensive use of table stats during query planning. For example, 
the NDV (number of distinct values) is used to compute _selectivity_, the 
degree of reduction (also called the _reduction factor_) provided by a 
predicate. For example:

{noformat}
SELECT * FROM t WHERE t.a = 10
{noformat}

If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
distribution of values), that the above query will pick out one of these 100 
values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity of 
the predicate {{t.a = 10}} is 0.01.

h4. Selectivity Without Stats

All this is good. But, what happens if statistics are not available for table 
{{t}}? How are we to know the selectivity of the predicate?

It could be that {{t.a}} contains nothing but the value 10, so there is no 
reduction at all. I could be that {{t.a}} contains no values of 10, so the 
reduction is total, no rows are returned. The classic solution is to assume 
that the user put the predicate in the query for the purpose of subsetting the 
data. The classic value, shown in the [Ramakrishnan and Gehrke 
book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is 
seen in 
[Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:

{noformat}
  // To be used where we cannot come up with a
  // better estimate (selectivity_ is -1).
  public static double DEFAULT_SELECTIVITY = 0.1;
{noformat}

As it turns out, however, the actual implementation is a bit more complex, as 
hinted at by the above comment. Impala relies on stats. Given stats, 
specifically the NDV, we compute selectivity as:

{noformat}
selectivity = 1 / ndv
{noformat}

What happens if there is no available NDV? In the case, we skip the selectivity 
calculation and leave it at a special default value of -1.0, which seems to 
indicate "unknown". See 
[{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].

Later, when we use the selectivity to calculate reduction factors, we simply 
skip any node with a selectivity of -1. You can see that in 
[{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].

The result is that Impala is a bit more strict than classic DB optimizers. If 
stats are present, they are used. If stats are not present, Impala assumes that 
predicates have no effect.

h4. Proposal: Add A-priori Selectivity Values

But, we said earlier that users include a predicate because they expect it to 
do something. So, we are actually discarding the (albeit vague) information 
that the user provided.

This is why many optimizers go ahead and assume a default 0.1 reduction factor 
for equality predicates even if no stats are available. The first proposal of 
this ticket is to use that default reduction factor even if no stats are 
present. This says that some reduction will occur, but, to be conservative, we 
assume not a huge reduction.

h4. Proposal: Add Selectivity for All Predicate Operators

As present, Impala computes reduction factors only for equality nodes. (See 
IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:

* {{!=}} - 0.1
* {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
* {{BETWEEN}} - 0.25

Over in the Drill project, DRILL-5254 attempted to work out better estimates 
based on math and probability. However, the conclusion there was that, without 
NDV and histograms, there is more information in the user's intent than in the 
math. That is, if the user writes {{WHERE t.a \!= 10}}, there is a conditional 
probability that the user believes that this is a highly restrictive predicate, 
especially on big data. So, the reduction factor (which is a probability) is 
the same for {{=}} and {{!=}} in the absence of information. The same reasoning 
probably led to the rule-of-thumb values in the 
[R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 book.

So, the second proposal is that Impala use the classic numbers for other 
operators when no stats are available.

h4. Proposal: Use Stats-Based Selectivity Estimates When Available

If stats are available, then we can "run the numbers" and get better estimates:

* {{p(a = x)}} = 1 / NDV
* {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV

So, the third proposal is to use the above math for the {{!=}} case.

h4. Proposal: Use Defaults when Stats are Unavailable or Not Useful

As it turns out, without histograms, the NDV by itself gives us no information 
by which to compute reduction for an inequality. So the, classic a-priori 
defaults are the still best guess.

This leads to the fourth proposal: that the classic defaults are used unless 
stats provides additional information to use instead. Said another way, retire 
the current selectivity = -1 ("undefined") concept: we will always have a 
selectivity, even if just a rule-of-thumb one. That is, some information is 
better than none.

h4. Define an A-priori NDV Estimate

Impala uses NDV for a number of purposes. If no stats are available, Impala 
assumes no NDV. Given the assumed reduction factor, we can guess an NDV = 1/0.1 
= 10. Depending on where NDV is used (need to research this), we might want to 
choose some other value, perhaps 100. So, the fifth proposal is to assume an 
a-priori (before observation) NDV value, with the actual default value TBD.

h4. Separate the NDV=0 and Unknown NDV cases

As described in IMPALA-7310, Impala does not currently handle the case of a 
table with stats, but a column in the table contains only NULL values. The 
stats mechanism defines HDV as "number of distinct non-null values". So, a 
table full of nulls has NDV = 0. Unfortunately, if no stats are available at 
all, we also have NDV = 0.

So, the sixth proposal is to separate these cases. If we can tell that a column 
is a nullable type, assume that the actual NDV is (stats NDV + 1). (This is, in 
fact the NDV calc used by Impala itself when computing NDV for expressions. See 
[{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java].

To avoid impact to existing tests, apply this rule only when it makes an 
impact, when the NDV value is, say, less than 10. (No reason to count the null 
value if NDV is large.) So, a column with all nulls will have an (adjusted) NDV 
= 1.

Then, as noted above, if no stats exist, use the a-priori NDV.

h4. References

* [Cost Estimation 
Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
* [Reduction 
factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
 from the classic Ramakrishnan and Gehrke book.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to