[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-05-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834046#comment-16834046
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

amansinha100 commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830440#comment-16830440
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

amansinha100 commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279831169
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/NumericEquiDepthHistogram.java
 ##
 @@ -76,129 +73,188 @@ public NumericEquiDepthHistogram(int numBuckets) {
 numRowsPerBucket = -1;
   }
 
-  public long getNumRowsPerBucket() {
+  public double getNumRowsPerBucket() {
 return numRowsPerBucket;
   }
 
-  public void setNumRowsPerBucket(long numRows) {
+  public void setNumRowsPerBucket(double numRows) {
 this.numRowsPerBucket = numRows;
   }
 
   public Double[] getBuckets() {
 return buckets;
   }
 
+  /**
+   * Get the number of buckets in the histogram
+   * number of buckets is 1 less than the total # entries in the buckets array 
since last
+   * entry is the end point of the last bucket
+   */
+  public int getNumBuckets() {
+return buckets.length - 1;
+  }
+
+  /**
+   * Estimate the selectivity of a filter which may contain several range 
predicates and in the general case is of
+   * type: col op value1 AND col op value2 AND col op value3 ...
+   *  
+   *e.g a > 10 AND a < 50 AND a >= 20 AND a <= 70 ...
+   *  
+   * Even though in most cases it will have either 1 or 2 range conditions, we 
still have to handle the general case
+   * For each conjunct, we will find the histogram bucket ranges and intersect 
them, taking into account that the
+   * first and last bucket may be partially covered and all other buckets in 
the middle are fully covered.
+   */
   @Override
-  public Double estimatedSelectivity(final RexNode filter) {
-if (numRowsPerBucket >= 0) {
-  // at a minimum, the histogram should have a start and end point of 1 
bucket, so at least 2 entries
-  Preconditions.checkArgument(buckets.length >= 2,  "Histogram has invalid 
number of entries");
-  final int first = 0;
-  final int last = buckets.length - 1;
-
-  // number of buckets is 1 less than the total # entries in the buckets 
array since last
-  // entry is the end point of the last bucket
-  final int numBuckets = buckets.length - 1;
-  final long totalRows = numBuckets * numRowsPerBucket;
+  public Double estimatedSelectivity(final RexNode columnFilter, final long 
totalRowCount) {
+if (numRowsPerBucket == 0) {
+  return null;
+}
+
+// at a minimum, the histogram should have a start and end point of 1 
bucket, so at least 2 entries
+Preconditions.checkArgument(buckets.length >= 2,  "Histogram has invalid 
number of entries");
+
+List filterList = RelOptUtil.conjunctions(columnFilter);
+
+Range fullRange = Range.all();
+List unknownFilterList = new ArrayList();
+
+Range valuesRange = getValuesRange(filterList, fullRange, 
unknownFilterList);
+
+long numSelectedRows;
+// unknown counter is a count of filter predicates whose bucket ranges 
cannot be
+// determined from the histogram; this may happen for instance when there 
is an expression or
+// function involved..e.g  col > CAST('10' as INT)
+int unknown = unknownFilterList.size();
+
+if (valuesRange.hasLowerBound() || valuesRange.hasUpperBound()) {
+  numSelectedRows = getSelectedRows(valuesRange);
+} else {
+  numSelectedRows = 0;
+}
+
+if (numSelectedRows <= 0) {
+  return SMALL_SELECTIVITY;
+} else {
+  // for each 'unknown' range filter selectivity, use a default of 0.5 
(matches Calcite)
+  double scaleFactor = Math.pow(0.5, unknown);
+  return  ((double) numSelectedRows / totalRowCount) * scaleFactor;
+}
+  }
+
+  private Range getValuesRange(List filterList, Range 
fullRange, List unkownFilterList) {
+Range currentRange = fullRange;
+for (RexNode filter : filterList) {
   if (filter instanceof RexCall) {
-// get the operator
-SqlOperator op = ((RexCall) filter).getOperator();
-if (op.getKind() == SqlKind.GREATER_THAN ||
-op.getKind() == SqlKind.GREATER_THAN_OR_EQUAL) {
-  Double value = getLiteralValue(filter);
-  if (value != null) {
-
-// *** Handle the boundary conditions first ***
-
-// if value is less than or equal to the first bucket's start 
point then all rows qualify
-int result = value.compareTo(buckets[first]);
-if (result <= 0) {
-  return LARGE_SELECTIVITY;
-}
-// if value is greater than the end point of the last bucket, then 
none of the rows qualify
-result = value.compareTo(buckets[last]);
-if (result > 0) {
-  return 

[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830443#comment-16830443
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

amansinha100 commented on issue #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#issuecomment-488018792
 
 
   @gparai I have addressed your review comments.  Pls take another look. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830436#comment-16830436
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

amansinha100 commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279829751
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/sql/TestAnalyze.java
 ##
 @@ -480,6 +490,24 @@ public void testHistogramWithColumnsWithAllNulls() throws 
Exception {
 }
   }
 
+
+  @Test
+  public void testHistogramWithBetweenPredicate() throws Exception {
+try {
+  test("ALTER SESSION SET `planner.slice_target` = 1");
+  test("ALTER SESSION SET `store.format` = 'parquet'");
+  test("create table dfs.tmp.orders2 as select * from 
cp.`tpch/orders.parquet`");
+  test("analyze table dfs.tmp.orders2 compute statistics");
+  test("alter session set `planner.statistics.use` = true");
+
+  String query = "select 1 from dfs.tmp.orders2 o where o.o_orderdate >= 
date '1996-10-01' and o.o_orderdate < date '1996-10-01' + interval '3' month";
 
 Review comment:
   I changed the test name to use 'Interval' instead of 'Between' since that's 
the main purpose of this test.  There are 2 other tests I added which do use 
`BETWEEN` clause in `testHistogramWithDataTypes1`. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830434#comment-16830434
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

amansinha100 commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279828965
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdSelectivity.java
 ##
 @@ -356,8 +426,8 @@ private boolean isMultiColumnPredicate(final RexNode node) 
{
 return findAllRexInputRefs(node).size() > 1;
   }
 
-  private static List findAllRexInputRefs(final RexNode node) {
-  List rexRefs = new ArrayList<>();
+  private static Set findAllRexInputRefs(final RexNode node) {
 
 Review comment:
   Yes, thanks for pointing that out, even though that predicate $0=$0 is 
unexpected (something to investigate in future).  I have reverted this change 
and it returns a List as before.  Instead, now where the original call to 
`isMultiColumnPredicate()` happens I added a second condition (line 182) that 
ensures that conditions of type `$1 > 10 AND $1 < 20` which are created after 
calling `preProcessRangeConditions()` are not considered the same as 
multicolumn predicates.  
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829826#comment-16829826
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

gparai commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279571671
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/exec/sql/TestAnalyze.java
 ##
 @@ -480,6 +490,24 @@ public void testHistogramWithColumnsWithAllNulls() throws 
Exception {
 }
   }
 
+
+  @Test
+  public void testHistogramWithBetweenPredicate() throws Exception {
+try {
+  test("ALTER SESSION SET `planner.slice_target` = 1");
+  test("ALTER SESSION SET `store.format` = 'parquet'");
+  test("create table dfs.tmp.orders2 as select * from 
cp.`tpch/orders.parquet`");
+  test("analyze table dfs.tmp.orders2 compute statistics");
+  test("alter session set `planner.statistics.use` = true");
+
+  String query = "select 1 from dfs.tmp.orders2 o where o.o_orderdate >= 
date '1996-10-01' and o.o_orderdate < date '1996-10-01' + interval '3' month";
 
 Review comment:
   I do not see the `BETWEEN` clause in this query.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829827#comment-16829827
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

gparai commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279574279
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/NumericEquiDepthHistogram.java
 ##
 @@ -76,129 +73,188 @@ public NumericEquiDepthHistogram(int numBuckets) {
 numRowsPerBucket = -1;
   }
 
-  public long getNumRowsPerBucket() {
+  public double getNumRowsPerBucket() {
 return numRowsPerBucket;
   }
 
-  public void setNumRowsPerBucket(long numRows) {
+  public void setNumRowsPerBucket(double numRows) {
 this.numRowsPerBucket = numRows;
   }
 
   public Double[] getBuckets() {
 return buckets;
   }
 
+  /**
+   * Get the number of buckets in the histogram
+   * number of buckets is 1 less than the total # entries in the buckets array 
since last
+   * entry is the end point of the last bucket
+   */
+  public int getNumBuckets() {
+return buckets.length - 1;
+  }
+
+  /**
+   * Estimate the selectivity of a filter which may contain several range 
predicates and in the general case is of
+   * type: col op value1 AND col op value2 AND col op value3 ...
+   *  
+   *e.g a > 10 AND a < 50 AND a >= 20 AND a <= 70 ...
+   *  
+   * Even though in most cases it will have either 1 or 2 range conditions, we 
still have to handle the general case
+   * For each conjunct, we will find the histogram bucket ranges and intersect 
them, taking into account that the
+   * first and last bucket may be partially covered and all other buckets in 
the middle are fully covered.
+   */
   @Override
-  public Double estimatedSelectivity(final RexNode filter) {
-if (numRowsPerBucket >= 0) {
-  // at a minimum, the histogram should have a start and end point of 1 
bucket, so at least 2 entries
-  Preconditions.checkArgument(buckets.length >= 2,  "Histogram has invalid 
number of entries");
-  final int first = 0;
-  final int last = buckets.length - 1;
-
-  // number of buckets is 1 less than the total # entries in the buckets 
array since last
-  // entry is the end point of the last bucket
-  final int numBuckets = buckets.length - 1;
-  final long totalRows = numBuckets * numRowsPerBucket;
+  public Double estimatedSelectivity(final RexNode columnFilter, final long 
totalRowCount) {
+if (numRowsPerBucket == 0) {
+  return null;
+}
+
+// at a minimum, the histogram should have a start and end point of 1 
bucket, so at least 2 entries
+Preconditions.checkArgument(buckets.length >= 2,  "Histogram has invalid 
number of entries");
+
+List filterList = RelOptUtil.conjunctions(columnFilter);
+
+Range fullRange = Range.all();
+List unknownFilterList = new ArrayList();
+
+Range valuesRange = getValuesRange(filterList, fullRange, 
unknownFilterList);
+
+long numSelectedRows;
+// unknown counter is a count of filter predicates whose bucket ranges 
cannot be
+// determined from the histogram; this may happen for instance when there 
is an expression or
+// function involved..e.g  col > CAST('10' as INT)
+int unknown = unknownFilterList.size();
+
+if (valuesRange.hasLowerBound() || valuesRange.hasUpperBound()) {
+  numSelectedRows = getSelectedRows(valuesRange);
+} else {
+  numSelectedRows = 0;
+}
+
+if (numSelectedRows <= 0) {
+  return SMALL_SELECTIVITY;
+} else {
+  // for each 'unknown' range filter selectivity, use a default of 0.5 
(matches Calcite)
+  double scaleFactor = Math.pow(0.5, unknown);
+  return  ((double) numSelectedRows / totalRowCount) * scaleFactor;
+}
+  }
+
+  private Range getValuesRange(List filterList, Range 
fullRange, List unkownFilterList) {
+Range currentRange = fullRange;
+for (RexNode filter : filterList) {
   if (filter instanceof RexCall) {
-// get the operator
-SqlOperator op = ((RexCall) filter).getOperator();
-if (op.getKind() == SqlKind.GREATER_THAN ||
-op.getKind() == SqlKind.GREATER_THAN_OR_EQUAL) {
-  Double value = getLiteralValue(filter);
-  if (value != null) {
-
-// *** Handle the boundary conditions first ***
-
-// if value is less than or equal to the first bucket's start 
point then all rows qualify
-int result = value.compareTo(buckets[first]);
-if (result <= 0) {
-  return LARGE_SELECTIVITY;
-}
-// if value is greater than the end point of the last bucket, then 
none of the rows qualify
-result = value.compareTo(buckets[last]);
-if (result > 0) {
-  return SMALL_SELECTIVITY;
- 

[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829825#comment-16829825
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

gparai commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772#discussion_r279571290
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdSelectivity.java
 ##
 @@ -356,8 +426,8 @@ private boolean isMultiColumnPredicate(final RexNode node) 
{
 return findAllRexInputRefs(node).size() > 1;
   }
 
-  private static List findAllRexInputRefs(final RexNode node) {
-  List rexRefs = new ArrayList<>();
+  private static Set findAllRexInputRefs(final RexNode node) {
 
 Review comment:
   Would this not break the existing logic? For a predicate like $0=$0 using 
the `Set` would cause the `isMultiColumnPredicate` function to return false.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7187) Improve selectivity estimates for range predicates when using histogram

2019-04-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829620#comment-16829620
 ] 

ASF GitHub Bot commented on DRILL-7187:
---

amansinha100 commented on pull request #1772: DRILL-7187: Improve selectivity 
estimation of BETWEEN predicates and …
URL: https://github.com/apache/drill/pull/1772
 
 
   …arbitrary combination of range predicates.
   
   - Also, propagate the totalRowCount to the histogram selectivity estimation 
and use it instead of the nonNullCount. 
   - Before and after estimates for the following predicate: 
   `where o.o_orderdate >= date '1996-10-01' and o.o_orderdate < date 
'1996-10-01' + interval '3' month`
   BEFORE this PR:  estimated filter row count = **3206**
   AFTER this PR: estimated filter row count = **601**
   ACTUAL row count   = **561**
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve selectivity estimates for range predicates when using histogram
> ---
>
> Key: DRILL-7187
> URL: https://issues.apache.org/jira/browse/DRILL-7187
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Aman Sinha
>Assignee: Aman Sinha
>Priority: Major
> Fix For: 1.17.0
>
>
> 2 types of selectivity estimation improvements need to be done:
> 1.  For range predicates on the same column, we need to collect all such 
> predicates in 1 group and do a histogram lookup for them together. 
> For instance: 
> {noformat}
>  WHERE a > 10 AND b < 20 AND c = 100 AND a <= 50 AND b < 50
> {noformat}
>  Currently, the Drill behavior is to treat each of the conjuncts 
> independently and multiply the individual selectivities.  However, that will 
> not give the accurate estimates. Here, we want to group the predicates on 'a' 
> together and do a single lookup.  Similarly for 'b'.  
> 2. NULLs are not maintained by the histogram but when doing the selectivity 
> calculations, the histogram should use the totalRowCount as the denominator 
> rather than the non-null count. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)