subject:"\[GitHub\] spark pull request #19783\: \[SPARK\-21322\]\[SQL\] support histogram in filter ca..."

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19783


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281819
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * The column value distribution is saved in an equi-height histogram.  
The return values is a
+   * double value is because we may return a portion of a bin. For 
example, a predicate
+   * "column = 8" may return the number of bins 0.2 if the holding bin has 
5 distinct values.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+assert(lowerId <= higherId)
+
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
--- End diff --

shall we assert that `lowerBin.lo <= lowerEnd`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
--- End diff --

Seems this is redundant, shall we remove it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281044
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
--- End diff --

nit: `number` -> `index`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281155
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281583
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * The column value distribution is saved in an equi-height histogram.  
The return values is a
+   * double value is because we may return a portion of a bin. For 
example, a predicate
+   * "column = 8" may return the number of bins 0.2 if the holding bin has 
5 distinct values.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
--- End diff --

nit: `higherIndex`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156282033
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,44 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+val histogram = colStat.histogram.get
+val hgmBins = histogram.bins
+
+// find bins where column's current min and max locate.  Note that 
a column's [min, max]
+// range may change due to another condition applied earlier.
+val min = EstimationUtils.toDecimal(colStat.min.get, 
literal.dataType).toDouble
+val max = EstimationUtils.toDecimal(colStat.max.get, 
literal.dataType).toDouble
+val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins)
--- End diff --

nit: `minBinIndex`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281162
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281918
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
--- End diff --

did you create a new method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156282461
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,44 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+val histogram = colStat.histogram.get
+val hgmBins = histogram.bins
+
+// find bins where column's current min and max locate.  Note that 
a column's [min, max]
+// range may change due to another condition applied earlier.
+val min = EstimationUtils.toDecimal(colStat.min.get, 
literal.dataType).toDouble
+val max = EstimationUtils.toDecimal(colStat.max.get, 
literal.dataType).toDouble
+val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins)
+val maxBinId = EstimationUtils.findLastBinForValue(max, hgmBins)
+
+// compute how many bins the column's current valid range [min, 
max] occupies.
+// Note that a column's [min, max] range may vary after we apply 
some filter conditions.
+val validRangeBins = EstimationUtils.getOccupationBins(maxBinId, 
minBinId, max,
+  min, histogram)
+
+val lowerBinId = EstimationUtils.findFirstBinForValue(datum, 
hgmBins)
+val higherBinId = EstimationUtils.findLastBinForValue(datum, 
hgmBins)
+assert(lowerBinId <= higherBinId)
+val lowerBinNdv = hgmBins(lowerBinId).ndv
+val higherBinNdv = hgmBins(higherBinId).ndv
+// assume uniform distribution in each bin
+val occupiedBins = if (lowerBinId == higherBinId) {
--- End diff --

is this just `EstimationUtils.getOccupationBins(higherBinId, lowerBinId, 
datum, datum, histogram)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-11 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r156281223
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,99 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
--- End diff --

redundant


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-10 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155964396
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / curBin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
--- End diff --

OK.  Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-10 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155963930
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

Agreed with wzhfy.  Today's logic is: for these 2 conditions, (column > x) 
and (column >= x), we set the min value to x.  We do not distinguish these 2 
cases.  This is because we do not know the exact next value larger than x if x 
is a continuous data type like double type.  We may do some special coding for 
discrete data types such as Boolean or integer.  But, as wzhfy said, it does 
not deserve the complexity.   


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-10 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155963740
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
   percent = 1.0
 } else {
   // This is the partial overlap case:
-  // Without advanced statistics like histogram, we assume uniform 
data distribution.
-  // We just prorate the adjusted range over the initial range to 
compute filter selectivity.
-  assert(max > min)
-  percent = op match {
-case _: LessThan =>
-  if (numericLiteral == max) {
-// If the literal value is right on the boundary, we can minus 
the part of the
-// boundary value (1/ndv).
-1.0 - 1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: LessThanOrEqual =>
-  if (numericLiteral == min) {
-// The boundary value is the only satisfying value.
-1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: GreaterThan =>
-  if (numericLiteral == min) {
-1.0 - 1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
-case _: GreaterThanOrEqual =>
-  if (numericLiteral == max) {
-1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
+
+  if (colStat.histogram.isEmpty) {
--- End diff --

We cannot move the if statement to upper level.  This is because, for the 
partial overlap case, we need to update the the [min, max] range for a given 
column.  For the no-overlap and complete-overlap cases, we do  not need to do 
so.  I think the current code is modular for this reason.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-10 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155963622
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
--- End diff --

O.  will do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155692778
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

That may need special code path for boolean type, but IMHO I don't think it 
deserves the complexity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155691788
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -529,6 +570,56 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 Some(percent)
   }
 
+  /**
+   * Returns the selectivity percentage for binary condition in the 
column's
+   * current valid range [min, max]
+   *
+   * @param op a binary comparison operator
+   * @param histogram a numeric equi-height histogram
+   * @param max the upper bound of the current valid range for a given 
column
+   * @param min the lower bound of the current valid range for a given 
column
+   * @param datumNumber the numeric value of a literal
+   * @return the selectivity percentage for a condition in the current 
range.
+   */
+
+  def computePercentByEquiHeightHgm(
+  op: BinaryComparison,
+  histogram: Histogram,
+  max: Double,
+  min: Double,
+  datumNumber: Double): Double = {
+// find bins where column's current min and max locate.  Note that a 
column's [min, max]
+// range may change due to another condition applied earlier.
+val minBinId = EstimationUtils.findFirstBinForValue(min, 
histogram.bins)
+val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins)
+assert(minBinId <= maxBinId)
+
+// compute how many bins the column's current valid range [min, max] 
occupies.
+// Note that a column's [min, max] range may vary after we apply some 
filter conditions.
+val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, 
minBinId, max,
--- End diff --

Personally I prefer to have this method unit-tested, because it's the core 
part of filter estimation. We can do this in follow-up anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155690722
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
+// Here we traverse histogram bins to locate the range of bins the 
literal values falls
+// into.  For skewed distribution, a literal value can occupy 
multiple bins.
+val hgmBins = colStat.histogram.get.bins
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
--- End diff --

yes, I'll refactor this part.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155683828
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155571394
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

Logically this is wrong, although it's not a big deal to break this case. 
Is there any way we can fix it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155566119
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
   percent = 1.0
 } else {
   // This is the partial overlap case:
-  // Without advanced statistics like histogram, we assume uniform 
data distribution.
-  // We just prorate the adjusted range over the initial range to 
compute filter selectivity.
-  assert(max > min)
-  percent = op match {
-case _: LessThan =>
-  if (numericLiteral == max) {
-// If the literal value is right on the boundary, we can minus 
the part of the
-// boundary value (1/ndv).
-1.0 - 1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: LessThanOrEqual =>
-  if (numericLiteral == min) {
-// The boundary value is the only satisfying value.
-1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: GreaterThan =>
-  if (numericLiteral == min) {
-1.0 - 1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
-case _: GreaterThanOrEqual =>
-  if (numericLiteral == max) {
-1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
+
+  if (colStat.histogram.isEmpty) {
--- End diff --

yea please do it


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155564302
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
+// Here we traverse histogram bins to locate the range of bins the 
literal values falls
+// into.  For skewed distribution, a literal value can occupy 
multiple bins.
+val hgmBins = colStat.histogram.get.bins
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+var lowerId, higherId = -1
+for (i <- hgmBins.indices) {
+  // if datum > upperBound, just traverse to next bin
+  if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
+  if (higherId < 0) {
+if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
+  ((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) {
+  higherId = i
+}
+   }
--- End diff --

how about 
```
var lowerId = -1
var highIdFound = false
var i = 0
while (i < hgmBins.length || highIdFound) {
  if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
  if (datum >= hgmBins(i).lo) highIdFound = true
}
val highId = i
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155562073
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
+// Here we traverse histogram bins to locate the range of bins the 
literal values falls
+// into.  For skewed distribution, a literal value can occupy 
multiple bins.
+val hgmBins = colStat.histogram.get.bins
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+var lowerId, higherId = -1
+for (i <- hgmBins.indices) {
--- End diff --

ditto, while loop here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155561210
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
+// Here we traverse histogram bins to locate the range of bins the 
literal values falls
+// into.  For skewed distribution, a literal value can occupy 
multiple bins.
+val hgmBins = colStat.histogram.get.bins
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
--- End diff --

cc @wzhfy , you would refactor this part to always use Double for CBO 
computing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155560861
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155560523
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155560228
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155560015
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r19543
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  if (colStat.histogram.isEmpty) {
+// returns 1/ndv if there is no histogram
+Some(1.0 / BigDecimal(ndv))
+  } else {
+// We compute filter selectivity using Histogram information.
--- End diff --

better to move these to a new method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r18939
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r18275
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of bins for column values in [lowerEnd, higherEnd].
+   */
+  def getOccupationBins(
+  higherId: Int,
+  lowerId: Int,
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+if (lowerId == higherId) {
+  val curBin = histogram.bins(lowerId)
+  getOccupation(higherEnd, lowerEnd, curBin)
+} else {
+  // compute how much lowerEnd/higherEnd occupies its bin
+  val lowerCurBin = histogram.bins(lowerId)
+  val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin)
+
+  val higherCurBin = histogram.bins(higherId)
+  val higherPart = getOccupation(higherEnd, higherCurBin.lo, 
higherCurBin)
+
+  // the total length is lowerPart + higherPart + bins between them
+  lowerPart + higherPart + higherId - lowerId - 1
+}
+  }
+
+  /**
+   * Returns the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the number of distinct values, ndv, for column values in 
[lowerEnd, higherEnd].
+   */
+  def getOccupationNdv(

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r17934
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,144 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the first bin into which a column value falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i += 1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column value falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the id of the last bin into which a column value falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -= 1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param bin a single histogram bin
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  higherValue: Double,
+  lowerValue: Double,
+  bin: HistogramBin): Double = {
+assert(bin.lo <= lowerValue && lowerValue <= higherValue && 
higherValue <= bin.hi)
+if (bin.hi == bin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / bin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.
+   *
+   * @param higherId id of the high end bin holding the high end value of 
a column range
+   * @param lowerId id of the low end bin holding the low end value of a 
column range
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
--- End diff --

do we have an assumption between `higherEnd`/`lowerEnd` and 
`higherId`/`lowerId`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-07 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r17144
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / curBin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
--- End diff --

i see. Let's explain this in the java doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155343962
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / curBin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
--- End diff --

This is because we may return a percentage of a bin.  For example, a 
predicate column=5 may return the number of bins 0.2 if the holding bin has 5 
distinct values.  Hence, we cannot return an integer type value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155233875
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / curBin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
--- End diff --

`Returns the number of bins...` why the return type is double?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155233734
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / curBin.ndv.toDouble
+} else {
+  // Use proration since the range falls inside this bin.
+  math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
--- End diff --

shouldn't check the overlapping percentage? Do I miss some assumption for 
this method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155233351
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
+} else if (higherValue == lowerValue) {
+  // set percentage to 1/NDV
+  1.0 / curBin.ndv.toDouble
--- End diff --

shouldn't we check the `lowerValue/higherValues` fits in the bin value 
range?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155233099
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (curBin.hi == curBin.lo) {
+  // the entire bin is covered in the range
+  1.0
--- End diff --

I don't get it, shouldn't we check `lowerValue <= curBin.lo <= higherValue` 
here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155232397
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
--- End diff --

the method signature looks weird, shouldn't it be
```
private def getOccupation(
  higherValue: Double,
  lowerValue: Double,
  bin: HistogramBin)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231792
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
+}
+i
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
--- End diff --

`a percentage of a bin...` do you mean `percentage of bins`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231547
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231606
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var i = bins.length - 1
+while ((i >= 0) && (value < bins(i).lo)) {
+  i -=1
--- End diff --

a space after `-=`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231523
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
+}
+i
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231456
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231340
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var i = 0
+while ((i < bins.length) && (value > bins(i).hi)) {
+  i +=1
--- End diff --

nit: s space after `+=`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-06 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155231264
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,171 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
--- End diff --

nit: `the id of the first bin into which the given value falls.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-05 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155125157
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
--- End diff --

Good point.  We can simplify the logic by iterating from bins.length-1 to 0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-05 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r155125029
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
--- End diff --

Good point.  Actually while loop is better because it can exit early when 
the condition no longer qualifies.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154850738
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
+  }
+} else {
+  if (curBin.hi == curBin.lo) {
+// the entire bin is covered in the range
+1.0
+  } else if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+  }
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the selectivity percentage for column values in [lowerValue, 
higherValue].
+   */
+  def getOccupationBins(
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+// find bins where current min and max locate
+val lowerBinId = findFirstBinForValue(lowerEnd, histogram.bins)
+val higherBinId = findLastBinForValue(higherEnd, histogram.bins)
+assert(lowerBinId <= higherBinId)
+
+// compute how much current [lowerEnd, higherEnd] range occupies the 
histogram in the
+// number of bins
+getOccupationBins(higherBinId, lowerBinId, higherEnd, lowerEnd, 
histogram)
+  }
+

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848509
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
--- End diff --

why do we need to specialize it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848428
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
--- End diff --

this is the only branch we need to specialize for `binId=0`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848277
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
--- End diff --

instead of accepting a `binId` and `histogram`, can't we just ask the 
caller side to pass a `HistogramBin`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848018
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
--- End diff --

Why is this method so different from `findFirstBinForValue`? It looks like 
we just need to reverse the iteration order, i.e. from `bins.length` to `0`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154847802
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
--- End diff --

this looks more like a while loop pattern, can we use while loop here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154847703
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
--- End diff --

shall we use binary search here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154270654
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
--- End diff --

just move this condition after the length check: 
```
if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1) && 
(value == histogram.bins(i + 1).lo))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154255068
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -578,6 +590,112 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 5)
   }
 
+  // The following test cases have histogram information collected for the 
test column
+  test("Not(cintHgm < 3 AND null)") {
+val condition = Not(And(LessThan(attrIntHgm, Literal(3)), 
Literal(null, IntegerType)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
+  expectedRowCount = 9)
+  }
+
+  test("cintHgm = 5") {
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(5)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = 
Some(5),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm = 0") {
+// This is an out-of-range case since 0 is outside the range [min, max]
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm < 3") {
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm < 0") {
+// This is a corner case since literal 0 is smaller than min.
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm <= 3") {
+validateEstimatedStats(
+  Filter(LessThanOrEqual(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 6") {
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 10") {
+// This is a corner case since max value is 10.
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(10)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm >= 6") {
+validateEstimatedStats(
+  Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm IS NULL") {
+validateEstimatedStats(
+  Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm IS NOT NULL") {
+validateEstimatedStats(
+  Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 
10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 10)
+  }
+
+  test("cintHgm > 3 AND cintHgm <= 6") {
+val condition = And(GreaterThan(attrIntHgm,
+  Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = 
Some(6),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 8)
+  }
+
+  test("cintHgm = 3 OR cintHgm = 6") {
--- End diff --

We have added histogram test cases for skewed distribution.  I will add 
more histogram test cases for non-skewed distribution.


---

-
To unsubscribe,

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154254419
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

My earlier comment mentioned this test case.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154254145
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
 : AttributeMap[ColumnStat] = {
 val newColumnStats = originalMap.map { case (attr, oriColStat) =>
-  // Update ndv based on the overall filter selectivity: scale down 
ndv if the number of rows
-  // decreases; otherwise keep it unchanged.
-  val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
-newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
   val colStat = 
updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
+  val newNdv = if (colStat.distinctCount > 1) {
--- End diff --

The old code does not work well for a couple of new skewed-distribution 
tests.  For example, test("cintHgm < 3") would fail.  Because it still computes 
to find newNdv in updateNdv() method.  But, in reality, we already scale it 
down to 1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154252063
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 
 op match {
   case _: GreaterThan | _: GreaterThanOrEqual =>
-// If new ndv is 1, then new max must be equal to new min.
-newMin = if (newNdv == 1) newMax else newValue
+newMin = newValue
   case _: LessThan | _: LessThanOrEqual =>
-newMax = if (newNdv == 1) newMin else newValue
+newMax = newValue
--- End diff --

Previously I coded that way because of a corner test case: test("cbool > 
false").  At that time, I set the newMin to newMax since newNdv = 1.  However, 
this logic does not work well for the skewed distribution test case: test 
("cintHgm < 3").  In this test, newMin=1 newMax=3.  I think the revised code 
makes better sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154250197
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
+case StringType | BinaryType =>
+  Some(1.0 / BigDecimal(ndv))
+
+case _ =>
+  // returns 1/ndv if there is no histogram
+  if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
+
+  // We traverse histogram bins to locate the literal value
+  val hgmBins = colStat.histogram.get.bins
+  val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+  // find the interval where this datum locates
+  var lowerId, higherId = -1
+  for (i <- hgmBins.indices) {
+// if datum > upperBound, just move to next bin
+if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
+if (higherId < 0) {
+  if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
+((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) 
{
+higherId = i
+  }
+}
+  }
+  assert(lowerId <= higherId)
+  val lowerBinNdv = hgmBins(lowerId).ndv
+  val higherBinNdv = hgmBins(higherId).ndv
+  // assume uniform distribution in each bin
+  val percent = if (lowerId == higherId) {
+(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1)
+  } else {
+1.0 / hgmBins.length * (higherId - lowerId - 1) +
+  (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) +
+  (1.0 / hgmBins.length) / math.max(higherBinNdv, 1)
+  }
+  Some(percent)
--- End diff --

Good point. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154249995
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
--- End diff --

No.  I meant the upper bound for the array of bins in a histogram.  The 
default length of the histogram bin array is 254.  When i is equal to 253 (the 
last bin), then i+1 is 254 leading to out-of-bound error.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154248775
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
--- End diff --

By "out of bound", do you mean it exceeds 100 length limit? You can just 
switch new line after `&&`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154248457
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
--- End diff --

If a user changes the data, statistics will be removed, or re-collected 
(only size currently), Spark already implements this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154225069
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
--- End diff --

I used two statements instead of one statement is because, when i points to 
the last bin, this condition "value == histogram.bins(i + 1).lo" may be out of 
bound.  By separating the conditions into two statements, we can be sure that 
the out-of-bound error will not happen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154223769
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
--- End diff --

same comment as in my last reply.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-30 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154223705
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
--- End diff --

I hesitate to add an assert statement here.  This is because an assert such 
as this may cause Spark system to crash if a user does not fresh his data 
statistics quickly.  In real world, a user may load data, collect statistics, 
and then add more incremental data, but does not collect statistics 
immediately.  He may issue a SQL query against his newly added data such as 
"WHERE column=xxx", where xxx is a new value in his incremental load.  After 
all, statistics are auxiliary, a query should still run even the statistics are 
not up to date.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153975383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -529,6 +575,55 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 Some(percent)
   }
 
+  /**
+   * Returns the selectivity percentage for a combined op-dataNumber in 
the column's
+   * current valid range [min, max]
--- End diff --

Returns the selectivity percentage for a binary condition given the 
column's current value range.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153973312
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
   percent = 1.0
 } else {
   // This is the partial overlap case:
-  // Without advanced statistics like histogram, we assume uniform 
data distribution.
-  // We just prorate the adjusted range over the initial range to 
compute filter selectivity.
-  assert(max > min)
-  percent = op match {
-case _: LessThan =>
-  if (numericLiteral == max) {
-// If the literal value is right on the boundary, we can minus 
the part of the
-// boundary value (1/ndv).
-1.0 - 1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: LessThanOrEqual =>
-  if (numericLiteral == min) {
-// The boundary value is the only satisfying value.
-1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: GreaterThan =>
-  if (numericLiteral == min) {
-1.0 - 1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
-case _: GreaterThanOrEqual =>
-  if (numericLiteral == max) {
-1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
+
+  if (colStat.histogram.isEmpty) {
--- End diff --

move this `if` to upper level, then we can reduce code diff


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153974257
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -578,6 +590,112 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 5)
   }
 
+  // The following test cases have histogram information collected for the 
test column
+  test("Not(cintHgm < 3 AND null)") {
+val condition = Not(And(LessThan(attrIntHgm, Literal(3)), 
Literal(null, IntegerType)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
+  expectedRowCount = 9)
+  }
+
+  test("cintHgm = 5") {
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(5)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = 
Some(5),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm = 0") {
+// This is an out-of-range case since 0 is outside the range [min, max]
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm < 3") {
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm < 0") {
+// This is a corner case since literal 0 is smaller than min.
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm <= 3") {
+validateEstimatedStats(
+  Filter(LessThanOrEqual(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 6") {
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 10") {
+// This is a corner case since max value is 10.
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(10)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm >= 6") {
+validateEstimatedStats(
+  Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm IS NULL") {
+validateEstimatedStats(
+  Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm IS NOT NULL") {
+validateEstimatedStats(
+  Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 
10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 10)
+  }
+
+  test("cintHgm > 3 AND cintHgm <= 6") {
+val condition = And(GreaterThan(attrIntHgm,
+  Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = 
Some(6),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 8)
+  }
+
+  test("cintHgm = 3 OR cintHgm = 6") {
--- End diff --

I think we don't need test cases for combination conditions like AND, OR, 
NOT, because histogram doesn't affect estimation logic for them. Instead, we 
need to test more cases for histogram, e.g. =, >=, >, <=, <, and for

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153972336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
+case StringType | BinaryType =>
+  Some(1.0 / BigDecimal(ndv))
+
+case _ =>
+  // returns 1/ndv if there is no histogram
+  if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
+
+  // We traverse histogram bins to locate the literal value
+  val hgmBins = colStat.histogram.get.bins
+  val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+  // find the interval where this datum locates
--- End diff --

we can remove this comment, it's explained above


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153973342
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
   percent = 1.0
 } else {
--- End diff --

`else if (colStat.histogram.isEmpty)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153976879
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -529,6 +575,55 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 Some(percent)
   }
 
+  /**
+   * Returns the selectivity percentage for a combined op-dataNumber in 
the column's
+   * current valid range [min, max]
+   *
+   * @param op a binary comparison operator
+   * @param histogram a numeric equi-height histogram
+   * @param max the upper bound of the current valid range for a given 
column
+   * @param min the lower bound of the current valid range for a given 
column
+   * @param datumNumber the numeric value of a literal
+   * @return the selectivity percentage for a condition in the current 
range.
+   */
+
+  def computePercentForNumericEquiHeightHgm(
--- End diff --

`computePercentByEquiHeightHgm` or just `computePercentByHistogram`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153972295
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
+case StringType | BinaryType =>
+  Some(1.0 / BigDecimal(ndv))
+
+case _ =>
+  // returns 1/ndv if there is no histogram
+  if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
+
+  // We traverse histogram bins to locate the literal value
--- End diff --

This comment is not accurate, here we want to get the bins occupied by the 
literal value, because if the value is skewed, it can occupy multiple bins.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153977915
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
--- End diff --

merge two `if`s: if ((value == histogram.bins(i).hi) && (value == 
histogram.bins(i + 1).lo) && (i < histogram.bins.length - 1))


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153977791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
--- End diff --

add assert here too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153978787
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
--- End diff --

could you remove the empty line between method comment and its definition?
same for other methods here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153976643
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
--- End diff --

we can just pass bin array as parameter, this can simplify the code.
same for other methods.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153972847
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
+case StringType | BinaryType =>
+  Some(1.0 / BigDecimal(ndv))
+
+case _ =>
+  // returns 1/ndv if there is no histogram
+  if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
+
+  // We traverse histogram bins to locate the literal value
+  val hgmBins = colStat.histogram.get.bins
+  val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+  // find the interval where this datum locates
+  var lowerId, higherId = -1
+  for (i <- hgmBins.indices) {
+// if datum > upperBound, just move to next bin
+if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
+if (higherId < 0) {
+  if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
+((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) 
{
+higherId = i
+  }
+}
+  }
+  assert(lowerId <= higherId)
+  val lowerBinNdv = hgmBins(lowerId).ndv
+  val higherBinNdv = hgmBins(higherId).ndv
+  // assume uniform distribution in each bin
+  val percent = if (lowerId == higherId) {
+(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1)
+  } else {
+1.0 / hgmBins.length * (higherId - lowerId - 1) +
+  (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) +
+  (1.0 / hgmBins.length) / math.max(higherBinNdv, 1)
+  }
+  Some(percent)
--- End diff --

How about simplifying the above logic as:
```
val occupiedBins = if (lowerId == higherId) {
  1.0 / lowerBinNdv
} else {
  (higherId - lowerId - 1) + 1.0 / lowerBinNdv + 1.0 / higherBinNdv
}
Some(occupiedBins / hgmBins.length)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153972517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
+case StringType | BinaryType =>
+  Some(1.0 / BigDecimal(ndv))
+
+case _ =>
+  // returns 1/ndv if there is no histogram
+  if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
+
+  // We traverse histogram bins to locate the literal value
+  val hgmBins = colStat.histogram.get.bins
+  val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+  // find the interval where this datum locates
+  var lowerId, higherId = -1
+  for (i <- hgmBins.indices) {
+// if datum > upperBound, just move to next bin
--- End diff --

please remove the comment, it does not match the logic at next line 
(there's no "move" logic)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153979575
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
+  // increment binId since the value appears into this bin and 
next bin
+  binId += 1
+}
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// in the case curBin.binNdv == 0, current bin is occupied by one 
value, which
+// is included in the previous bin
+1.0 / math.max(curBin.ndv.toDouble, 1)
+  } else {
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
+  }
+} else {
+  if (curBin.hi == curBin.lo) {
+// the entire bin is covered in the range
+1.0
+  } else if (higherValue == lowerValue) {
+// the literal value falls in this bin
+1.0 / math.max(curBin.ndv.toDouble, 1)
+  } else {
+// Use proration since the range falls inside this bin.
+math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+  }
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the selectivity percentage for column values in [lowerValue, 
higherValue].
+   */
+
+  def getOccupationBins(
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+// find bins where current min and max locate
+val minBinId = findFirstBinForValue(lowerEnd, histogram)
+val maxBinId = findLastBinForValue(higherEnd, histogram)
+assert(minBinId <= maxBinId)
+
+// compute how much current [min, max] occupy the histogram, in the 
number of bins
+getOccupationBins(maxBinId, minBinId, higherEnd, lowerEnd, histogram)
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * This is an overloaded method. The column value distribution is saved 
in an
+   * equi-height histogram.

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153979438
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
+  // increment binId since the value appears into this bin and 
next bin
+  binId += 1
+}
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// in the case curBin.binNdv == 0, current bin is occupied by one 
value, which
--- End diff --

binNdv will never be zero


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153976662
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
--- End diff --

i <- bins.indices


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153970902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
--- End diff --

move this comment where the histogram computation really starts


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153975698
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: 
AttributeMap[ColumnStat]) {
   def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt)
 : AttributeMap[ColumnStat] = {
 val newColumnStats = originalMap.map { case (attr, oriColStat) =>
-  // Update ndv based on the overall filter selectivity: scale down 
ndv if the number of rows
-  // decreases; otherwise keep it unchanged.
-  val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter,
-newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount)
   val colStat = 
updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat)
+  val newNdv = if (colStat.distinctCount > 1) {
--- End diff --

no need to add extra check here, in `EstimationUtils.updateNdv` we already 
check this case. We can just revert the change here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153973781
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -578,6 +590,112 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   expectedRowCount = 5)
   }
 
+  // The following test cases have histogram information collected for the 
test column
+  test("Not(cintHgm < 3 AND null)") {
+val condition = Not(And(LessThan(attrIntHgm, Literal(3)), 
Literal(null, IntegerType)))
+validateEstimatedStats(
+  Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)),
+  expectedRowCount = 9)
+  }
+
+  test("cintHgm = 5") {
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(5)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = 
Some(5),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm = 0") {
+// This is an out-of-range case since 0 is outside the range [min, max]
+validateEstimatedStats(
+  Filter(EqualTo(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm < 3") {
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm < 0") {
+// This is a corner case since literal 0 is smaller than min.
+validateEstimatedStats(
+  Filter(LessThan(attrIntHgm, Literal(0)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm <= 3") {
+validateEstimatedStats(
+  Filter(LessThanOrEqual(attrIntHgm, Literal(3)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = 
Some(3),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 6") {
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 2)
+  }
+
+  test("cintHgm > 10") {
+// This is a corner case since max value is 10.
+validateEstimatedStats(
+  Filter(GreaterThan(attrIntHgm, Literal(10)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm >= 6") {
+validateEstimatedStats(
+  Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), 
childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = 
Some(10),
+nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))),
+  expectedRowCount = 4)
+  }
+
+  test("cintHgm IS NULL") {
+validateEstimatedStats(
+  Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)),
+  Nil,
+  expectedRowCount = 0)
+  }
+
+  test("cintHgm IS NOT NULL") {
--- End diff --

histogram does not affect estimation logic for `IS NULL` and `IS NOT NULL` 
filter conditions, we can remove these two test cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153977299
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
--- End diff --

for robustness, add `assert(bins.head.lo <= value && bins.last.hi >= value)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153977529
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
+  // increment binId since the value appears into this bin and 
next bin
--- End diff --

appears in both this bin and the next bin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153974940
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
   percent = 1.0
 } else {
   // This is the partial overlap case:
-  // Without advanced statistics like histogram, we assume uniform 
data distribution.
-  // We just prorate the adjusted range over the initial range to 
compute filter selectivity.
-  assert(max > min)
-  percent = op match {
-case _: LessThan =>
-  if (numericLiteral == max) {
-// If the literal value is right on the boundary, we can minus 
the part of the
-// boundary value (1/ndv).
-1.0 - 1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: LessThanOrEqual =>
-  if (numericLiteral == min) {
-// The boundary value is the only satisfying value.
-1.0 / ndv
-  } else {
-(numericLiteral - min) / (max - min)
-  }
-case _: GreaterThan =>
-  if (numericLiteral == min) {
-1.0 - 1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
-case _: GreaterThanOrEqual =>
-  if (numericLiteral == max) {
-1.0 / ndv
-  } else {
-(max - numericLiteral) / (max - min)
-  }
+
+  if (colStat.histogram.isEmpty) {
+// Without advanced statistics like histogram, we assume uniform 
data distribution.
+// We just prorate the adjusted range over the initial range to 
compute filter selectivity.
+assert(max > min)
+percent = op match {
+  case _: LessThan =>
+if (numericLiteral == max) {
+  // If the literal value is right on the boundary, we can 
minus the part of the
+  // boundary value (1/ndv).
+  1.0 - 1.0 / ndv
+} else {
+  (numericLiteral - min) / (max - min)
+}
+  case _: LessThanOrEqual =>
+if (numericLiteral == min) {
+  // The boundary value is the only satisfying value.
+  1.0 / ndv
+} else {
+  (numericLiteral - min) / (max - min)
+}
+  case _: GreaterThan =>
+if (numericLiteral == min) {
+  1.0 - 1.0 / ndv
+} else {
+  (max - numericLiteral) / (max - min)
+}
+  case _: GreaterThanOrEqual =>
+if (numericLiteral == max) {
+  1.0 / ndv
+} else {
+  (max - numericLiteral) / (max - min)
+}
+}
+  } else {
+val numericHistogram = colStat.histogram.get
+val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+val maxDouble = EstimationUtils.toDecimal(colStat.max.get, 
literal.dataType).toDouble
+val minDouble = EstimationUtils.toDecimal(colStat.min.get, 
literal.dataType).toDouble
--- End diff --

`max, min` is good enough I think


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153973474
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 
 op match {
   case _: GreaterThan | _: GreaterThanOrEqual =>
-// If new ndv is 1, then new max must be equal to new min.
-newMin = if (newNdv == 1) newMax else newValue
+newMin = newValue
   case _: LessThan | _: LessThanOrEqual =>
-newMax = if (newNdv == 1) newMin else newValue
+newMax = newValue
--- End diff --

why change these two line?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153973544
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -359,7 +371,7 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
   test("cbool > false") {
 validateEstimatedStats(
   Filter(GreaterThan(attrBool, Literal(false)), 
childStatsTestPlan(Seq(attrBool), 10L)),
-  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max 
= Some(true),
+  Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max 
= Some(true),
--- End diff --

why the result is changed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153972982
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
+case StringType | BinaryType =>
+  Some(1.0 / BigDecimal(ndv))
+
+case _ =>
+  // returns 1/ndv if there is no histogram
+  if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv))
+
+  // We traverse histogram bins to locate the literal value
+  val hgmBins = colStat.histogram.get.bins
+  val datum = EstimationUtils.toDecimal(literal.value, 
literal.dataType).toDouble
+  // find the interval where this datum locates
+  var lowerId, higherId = -1
+  for (i <- hgmBins.indices) {
+// if datum > upperBound, just move to next bin
+if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
+if (higherId < 0) {
+  if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) ||
+((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) 
{
+higherId = i
+  }
+}
+  }
+  assert(lowerId <= higherId)
+  val lowerBinNdv = hgmBins(lowerId).ndv
+  val higherBinNdv = hgmBins(higherId).ndv
+  // assume uniform distribution in each bin
+  val percent = if (lowerId == higherId) {
+(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1)
+  } else {
+1.0 / hgmBins.length * (higherId - lowerId - 1) +
+  (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) +
+  (1.0 / hgmBins.length) / math.max(higherBinNdv, 1)
--- End diff --

bin's ndv will never be less than 1, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153979157
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+
+  def findFirstBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+histogram.bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+
+  def findLastBinForValue(value: Double, histogram: Histogram): Int = {
+var binId = 0
+for (i <- 0 until histogram.bins.length) {
+  if (value > histogram.bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 
1)) {
+if (value == histogram.bins(i + 1).lo) {
+  // increment binId since the value appears into this bin and 
next bin
+  binId += 1
+}
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// in the case curBin.binNdv == 0, current bin is occupied by one 
value, which
+// is included in the previous bin
+1.0 / math.max(curBin.ndv.toDouble, 1)
+  } else {
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
+  }
+} else {
+  if (curBin.hi == curBin.lo) {
+// the entire bin is covered in the range
+1.0
+  } else if (higherValue == lowerValue) {
+// the literal value falls in this bin
+1.0 / math.max(curBin.ndv.toDouble, 1)
+  } else {
+// Use proration since the range falls inside this bin.
+math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+  }
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the selectivity percentage for column values in [lowerValue, 
higherValue].
+   */
+
+  def getOccupationBins(
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+// find bins where current min and max locate
+val minBinId = findFirstBinForValue(lowerEnd, histogram)
+val maxBinId = findLastBinForValue(higherEnd, histogram)
--- End diff --

how about `lowerBinId, higherBinId`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153970815
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends 
Logging {
 colStatsMap.update(attr, newStats)
   }
 
-  Some(1.0 / BigDecimal(ndv))
-} else {
+  // We compute filter selectivity using Histogram information
+  attr.dataType match {
--- End diff --

use ` if (colStat.histogram.isEmpty)` to seperate the logic of basic stats 
(`Some(1.0 / BigDecimal(ndv))`) and histogram computation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-29 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153902382
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin/bucket into which a column values 
falls for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin/bucket into which a column values 
falls.
+   */
+
+  def findFirstBucketForValue(value: Double, histogram: Histogram): Int = {
--- End diff --

We had bucket(s) and bin(s) used interchangeably.  To avoid confusion, I 
will unify them to use only bin/bins.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-28 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r153686107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,197 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin/bucket into which a column values 
falls for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param histogram a numeric equi-height histogram
+   * @return the number of the first bin/bucket into which a column values 
falls.
+   */
+
+  def findFirstBucketForValue(value: Double, histogram: Histogram): Int = {
--- End diff --

Shall we unify all names to `bin`/`bins` in code and comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-20 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r151979117
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -158,8 +196,8 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
 val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, 
IntegerType)))
 validateEstimatedStats(
   Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)),
-  Seq(attrInt -> colStatInt.copy(distinctCount = 8)),
-  expectedRowCount = 8)
+  Seq(attrInt -> colStatInt.copy(distinctCount = 7)),
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-11-19 Thread wzhfy

Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r151890602
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala
 ---
@@ -158,8 +196,8 @@ class FilterEstimationSuite extends 
StatsEstimationTestBase {
 val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, 
IntegerType)))
 validateEstimatedStats(
   Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)),
-  Seq(attrInt -> colStatInt.copy(distinctCount = 8)),
-  expectedRowCount = 8)
+  Seq(attrInt -> colStatInt.copy(distinctCount = 7)),
--- End diff --

Shall we add new test cases for filter estimation based on histogram, 
instead of modifying existing test results?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

98 matches

Mail list logo