[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19783 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281819 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * The column value distribution is saved in an equi-height histogram. The return values is a + * double value is because we may return a portion of a bin. For example, a predicate + * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +assert(lowerId <= higherId) + +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) --- End diff -- shall we assert that `lowerBin.lo <= lowerEnd` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. --- End diff -- Seems this is redundant, shall we remove it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281044 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified --- End diff -- nit: `number` -> `index`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281155 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281583 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * The column value distribution is saved in an equi-height histogram. The return values is a + * double value is because we may return a portion of a bin. For example, a predicate + * "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values. + * + * @param higherId id of the high end bin holding the high end value of a column range --- End diff -- nit: `higherIndex` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156282033 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,44 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble +val histogram = colStat.histogram.get +val hgmBins = histogram.bins + +// find bins where column's current min and max locate. Note that a column's [min, max] +// range may change due to another condition applied earlier. +val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble +val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble +val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins) --- End diff -- nit: `minBinIndex` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281162 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281918 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. --- End diff -- did you create a new method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156282461 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,44 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble +val histogram = colStat.histogram.get +val hgmBins = histogram.bins + +// find bins where column's current min and max locate. Note that a column's [min, max] +// range may change due to another condition applied earlier. +val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble +val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble +val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins) +val maxBinId = EstimationUtils.findLastBinForValue(max, hgmBins) + +// compute how many bins the column's current valid range [min, max] occupies. +// Note that a column's [min, max] range may vary after we apply some filter conditions. +val validRangeBins = EstimationUtils.getOccupationBins(maxBinId, minBinId, max, + min, histogram) + +val lowerBinId = EstimationUtils.findFirstBinForValue(datum, hgmBins) +val higherBinId = EstimationUtils.findLastBinForValue(datum, hgmBins) +assert(lowerBinId <= higherBinId) +val lowerBinNdv = hgmBins(lowerBinId).ndv +val higherBinNdv = hgmBins(higherBinId).ndv +// assume uniform distribution in each bin +val occupiedBins = if (lowerBinId == higherBinId) { --- End diff -- is this just `EstimationUtils.getOccupationBins(higherBinId, lowerBinId, datum, datum, histogram)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r156281223 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,99 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. --- End diff -- redundant --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155964396 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / curBin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. --- End diff -- OK. Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155963930 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- Agreed with wzhfy. Today's logic is: for these 2 conditions, (column > x) and (column >= x), we set the min value to x. We do not distinguish these 2 cases. This is because we do not know the exact next value larger than x if x is a continuous data type like double type. We may do some special coding for discrete data types such as Boolean or integer. But, as wzhfy said, it does not deserve the complexity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155963740 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging { percent = 1.0 } else { // This is the partial overlap case: - // Without advanced statistics like histogram, we assume uniform data distribution. - // We just prorate the adjusted range over the initial range to compute filter selectivity. - assert(max > min) - percent = op match { -case _: LessThan => - if (numericLiteral == max) { -// If the literal value is right on the boundary, we can minus the part of the -// boundary value (1/ndv). -1.0 - 1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: LessThanOrEqual => - if (numericLiteral == min) { -// The boundary value is the only satisfying value. -1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: GreaterThan => - if (numericLiteral == min) { -1.0 - 1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } -case _: GreaterThanOrEqual => - if (numericLiteral == max) { -1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } + + if (colStat.histogram.isEmpty) { --- End diff -- We cannot move the if statement to upper level. This is because, for the partial overlap case, we need to update the the [min, max] range for a given column. For the no-overlap and complete-overlap cases, we do not need to do so. I think the current code is modular for this reason. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155963622 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. --- End diff -- O. will do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155692778 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- That may need special code path for boolean type, but IMHO I don't think it deserves the complexity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155691788 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -529,6 +570,56 @@ case class FilterEstimation(plan: Filter) extends Logging { Some(percent) } + /** + * Returns the selectivity percentage for binary condition in the column's + * current valid range [min, max] + * + * @param op a binary comparison operator + * @param histogram a numeric equi-height histogram + * @param max the upper bound of the current valid range for a given column + * @param min the lower bound of the current valid range for a given column + * @param datumNumber the numeric value of a literal + * @return the selectivity percentage for a condition in the current range. + */ + + def computePercentByEquiHeightHgm( + op: BinaryComparison, + histogram: Histogram, + max: Double, + min: Double, + datumNumber: Double): Double = { +// find bins where column's current min and max locate. Note that a column's [min, max] +// range may change due to another condition applied earlier. +val minBinId = EstimationUtils.findFirstBinForValue(min, histogram.bins) +val maxBinId = EstimationUtils.findLastBinForValue(max, histogram.bins) +assert(minBinId <= maxBinId) + +// compute how many bins the column's current valid range [min, max] occupies. +// Note that a column's [min, max] range may vary after we apply some filter conditions. +val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, minBinId, max, --- End diff -- Personally I prefer to have this method unit-tested, because it's the core part of filter estimation. We can do this in follow-up anyway. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155690722 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. +// Here we traverse histogram bins to locate the range of bins the literal values falls +// into. For skewed distribution, a literal value can occupy multiple bins. +val hgmBins = colStat.histogram.get.bins +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble --- End diff -- yes, I'll refactor this part. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155683828 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155571394 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- Logically this is wrong, although it's not a big deal to break this case. Is there any way we can fix it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155566119 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging { percent = 1.0 } else { // This is the partial overlap case: - // Without advanced statistics like histogram, we assume uniform data distribution. - // We just prorate the adjusted range over the initial range to compute filter selectivity. - assert(max > min) - percent = op match { -case _: LessThan => - if (numericLiteral == max) { -// If the literal value is right on the boundary, we can minus the part of the -// boundary value (1/ndv). -1.0 - 1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: LessThanOrEqual => - if (numericLiteral == min) { -// The boundary value is the only satisfying value. -1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: GreaterThan => - if (numericLiteral == min) { -1.0 - 1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } -case _: GreaterThanOrEqual => - if (numericLiteral == max) { -1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } + + if (colStat.histogram.isEmpty) { --- End diff -- yea please do it --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155564302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. +// Here we traverse histogram bins to locate the range of bins the literal values falls +// into. For skewed distribution, a literal value can occupy multiple bins. +val hgmBins = colStat.histogram.get.bins +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble +var lowerId, higherId = -1 +for (i <- hgmBins.indices) { + // if datum > upperBound, just traverse to next bin + if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i + if (higherId < 0) { +if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) || + ((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) { + higherId = i +} + } --- End diff -- how about ``` var lowerId = -1 var highIdFound = false var i = 0 while (i < hgmBins.length || highIdFound) { if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i if (datum >= hgmBins(i).lo) highIdFound = true } val highId = i ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155562073 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. +// Here we traverse histogram bins to locate the range of bins the literal values falls +// into. For skewed distribution, a literal value can occupy multiple bins. +val hgmBins = colStat.histogram.get.bins +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble +var lowerId, higherId = -1 +for (i <- hgmBins.indices) { --- End diff -- ditto, while loop here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155561210 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. +// Here we traverse histogram bins to locate the range of bins the literal values falls +// into. For skewed distribution, a literal value can occupy multiple bins. +val hgmBins = colStat.histogram.get.bins +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble --- End diff -- cc @wzhfy , you would refactor this part to always use Double for CBO computing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155560861 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155560523 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155560228 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155560015 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r19543 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,41 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + if (colStat.histogram.isEmpty) { +// returns 1/ndv if there is no histogram +Some(1.0 / BigDecimal(ndv)) + } else { +// We compute filter selectivity using Histogram information. --- End diff -- better to move these to a new method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r18939 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r18275 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of bins for column values in [lowerEnd, higherEnd]. + */ + def getOccupationBins( + higherId: Int, + lowerId: Int, + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +if (lowerId == higherId) { + val curBin = histogram.bins(lowerId) + getOccupation(higherEnd, lowerEnd, curBin) +} else { + // compute how much lowerEnd/higherEnd occupies its bin + val lowerCurBin = histogram.bins(lowerId) + val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) + + val higherCurBin = histogram.bins(higherId) + val higherPart = getOccupation(higherEnd, higherCurBin.lo, higherCurBin) + + // the total length is lowerPart + higherPart + bins between them + lowerPart + higherPart + higherId - lowerId - 1 +} + } + + /** + * Returns the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the number of distinct values, ndv, for column values in [lowerEnd, higherEnd]. + */ + def getOccupationNdv(
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r17934 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,144 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the first bin into which a column value falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i += 1 +} +i + } + + /** + * Returns the number of the last bin into which a column value falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the id of the last bin into which a column value falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -= 1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param bin a single histogram bin + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + higherValue: Double, + lowerValue: Double, + bin: HistogramBin): Double = { +assert(bin.lo <= lowerValue && lowerValue <= higherValue && higherValue <= bin.hi) +if (bin.hi == bin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / bin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (bin.hi - bin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram. + * + * @param higherId id of the high end bin holding the high end value of a column range + * @param lowerId id of the low end bin holding the low end value of a column range + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range --- End diff -- do we have an assumption between `higherEnd`/`lowerEnd` and `higherId`/`lowerId`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r17144 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / curBin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. --- End diff -- i see. Let's explain this in the java doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155343962 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / curBin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. --- End diff -- This is because we may return a percentage of a bin. For example, a predicate column=5 may return the number of bins 0.2 if the holding bin has 5 distinct values. Hence, we cannot return an integer type value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155233875 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / curBin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. --- End diff -- `Returns the number of bins...` why the return type is double? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155233734 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / curBin.ndv.toDouble +} else { + // Use proration since the range falls inside this bin. + math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) --- End diff -- shouldn't check the overlapping percentage? Do I miss some assumption for this method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155233351 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 +} else if (higherValue == lowerValue) { + // set percentage to 1/NDV + 1.0 / curBin.ndv.toDouble --- End diff -- shouldn't we check the `lowerValue/higherValues` fits in the bin value range? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155233099 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (curBin.hi == curBin.lo) { + // the entire bin is covered in the range + 1.0 --- End diff -- I don't get it, shouldn't we check `lowerValue <= curBin.lo <= higherValue` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155232397 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { --- End diff -- the method signature looks weird, shouldn't it be ``` private def getOccupation( higherValue: Double, lowerValue: Double, bin: HistogramBin) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231792 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 +} +i + } + + /** + * Returns a percentage of a bin holding values for column value in the range of --- End diff -- `a percentage of a bin...` do you mean `percentage of bins`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231547 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231606 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = bins.length - 1 +while ((i >= 0) && (value < bins(i).lo)) { + i -=1 --- End diff -- a space after `-=` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231523 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 +} +i + } + + /** + * Returns the number of the last bin into which a column values falls for a specified --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231340 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var i = 0 +while ((i < bins.length) && (value > bins(i).hi)) { + i +=1 --- End diff -- nit: s space after `+=` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155231264 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,171 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. --- End diff -- nit: `the id of the first bin into which the given value falls.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155125157 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { --- End diff -- Good point. We can simplify the logic by iterating from bins.length-1 to 0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r155125029 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 --- End diff -- Good point. Actually while loop is better because it can exit early when the condition no longer qualifies. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154850738 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +for (i <- bins.indices) { + if (value > bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) { +// We assume the above 3 conditions will be evaluated from left to right sequentially. +// If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen. +// At that time, we should split the third condition into another if statement. +// increment binId since the value appears in this bin and next bin +binId += 1 + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (binId == 0 && curBin.hi == curBin.lo) { + // the Min of the histogram occupies the whole first bin + 1.0 +} else if (binId == 0 && curBin.hi != curBin.lo) { + if (higherValue == lowerValue) { +// set percentage to 1/NDV +1.0 / curBin.ndv.toDouble + } else { +// Use proration since the range falls inside this bin. +(higherValue - lowerValue) / (curBin.hi - curBin.lo) + } +} else { + if (curBin.hi == curBin.lo) { +// the entire bin is covered in the range +1.0 + } else if (higherValue == lowerValue) { +// set percentage to 1/NDV +1.0 / curBin.ndv.toDouble + } else { +// Use proration since the range falls inside this bin. +math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) + } +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the selectivity percentage for column values in [lowerValue, higherValue]. + */ + def getOccupationBins( + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +// find bins where current min and max locate +val lowerBinId = findFirstBinForValue(lowerEnd, histogram.bins) +val higherBinId = findLastBinForValue(higherEnd, histogram.bins) +assert(lowerBinId <= higherBinId) + +// compute how much current [lowerEnd, higherEnd] range occupies the histogram in the +// number of bins +getOccupationBins(higherBinId, lowerBinId, higherEnd, lowerEnd, histogram) + } +
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154848509 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +for (i <- bins.indices) { + if (value > bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) { +// We assume the above 3 conditions will be evaluated from left to right sequentially. +// If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen. +// At that time, we should split the third condition into another if statement. +// increment binId since the value appears in this bin and next bin +binId += 1 + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (binId == 0 && curBin.hi == curBin.lo) { + // the Min of the histogram occupies the whole first bin + 1.0 +} else if (binId == 0 && curBin.hi != curBin.lo) { + if (higherValue == lowerValue) { +// set percentage to 1/NDV +1.0 / curBin.ndv.toDouble + } else { +// Use proration since the range falls inside this bin. +(higherValue - lowerValue) / (curBin.hi - curBin.lo) --- End diff -- why do we need to specialize it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154848428 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +for (i <- bins.indices) { + if (value > bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) { +// We assume the above 3 conditions will be evaluated from left to right sequentially. +// If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen. +// At that time, we should split the third condition into another if statement. +// increment binId since the value appears in this bin and next bin +binId += 1 + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (binId == 0 && curBin.hi == curBin.lo) { + // the Min of the histogram occupies the whole first bin + 1.0 +} else if (binId == 0 && curBin.hi != curBin.lo) { + if (higherValue == lowerValue) { +// set percentage to 1/NDV +1.0 / curBin.ndv.toDouble + } else { +// Use proration since the range falls inside this bin. +(higherValue - lowerValue) / (curBin.hi - curBin.lo) --- End diff -- this is the only branch we need to specialize for `binId=0`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154848277 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +for (i <- bins.indices) { + if (value > bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == bins(i).hi) && (i < bins.length - 1) && (value == bins(i + 1).lo)) { +// We assume the above 3 conditions will be evaluated from left to right sequentially. +// If the above 3 conditions are evaluated out-of-order, then out-of-bound error may happen. +// At that time, we should split the third condition into another if statement. +// increment binId since the value appears in this bin and next bin +binId += 1 + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { --- End diff -- instead of accepting a `binId` and `histogram`, can't we just ask the caller side to pass a `HistogramBin`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154848018 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { --- End diff -- Why is this method so different from `findFirstBinForValue`? It looks like we just need to reverse the iteration order, i.e. from `bins.length` to `0`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154847802 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 --- End diff -- this looks more like a while loop pattern, can we use while loop here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154847703 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,194 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param bins an array of bins for a given numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { +var binId = 0 +bins.foreach { bin => + if (value > bin.hi) binId += 1 --- End diff -- shall we use binary search here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154270654 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- just move this condition after the length check: ``` if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1) && (value == histogram.bins(i + 1).lo)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154255068 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 5) } + // The following test cases have histogram information collected for the test column + test("Not(cintHgm < 3 AND null)") { +val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)), + expectedRowCount = 9) + } + + test("cintHgm = 5") { +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm = 0") { +// This is an out-of-range case since 0 is outside the range [min, max] +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm < 3") { +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm < 0") { +// This is a corner case since literal 0 is smaller than min. +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm <= 3") { +validateEstimatedStats( + Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 6") { +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 10") { +// This is a corner case since max value is 10. +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm >= 6") { +validateEstimatedStats( + Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm IS NULL") { +validateEstimatedStats( + Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm IS NOT NULL") { +validateEstimatedStats( + Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 10) + } + + test("cintHgm > 3 AND cintHgm <= 6") { +val condition = And(GreaterThan(attrIntHgm, + Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = Some(6), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 8) + } + + test("cintHgm = 3 OR cintHgm = 6") { --- End diff -- We have added histogram test cases for skewed distribution. I will add more histogram test cases for non-skewed distribution. --- - To unsubscribe,
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154254419 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- My earlier comment mentioned this test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154254145 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt) : AttributeMap[ColumnStat] = { val newColumnStats = originalMap.map { case (attr, oriColStat) => - // Update ndv based on the overall filter selectivity: scale down ndv if the number of rows - // decreases; otherwise keep it unchanged. - val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter, -newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount) val colStat = updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat) + val newNdv = if (colStat.distinctCount > 1) { --- End diff -- The old code does not work well for a couple of new skewed-distribution tests. For example, test("cintHgm < 3") would fail. Because it still computes to find newNdv in updateNdv() method. But, in reality, we already scale it down to 1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154252063 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends Logging { op match { case _: GreaterThan | _: GreaterThanOrEqual => -// If new ndv is 1, then new max must be equal to new min. -newMin = if (newNdv == 1) newMax else newValue +newMin = newValue case _: LessThan | _: LessThanOrEqual => -newMax = if (newNdv == 1) newMin else newValue +newMax = newValue --- End diff -- Previously I coded that way because of a corner test case: test("cbool > false"). At that time, I set the newMin to newMax since newNdv = 1. However, this logic does not work well for the skewed distribution test case: test ("cintHgm < 3"). In this test, newMin=1 newMax=3. I think the revised code makes better sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154250197 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value + val hgmBins = colStat.histogram.get.bins + val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble + // find the interval where this datum locates + var lowerId, higherId = -1 + for (i <- hgmBins.indices) { +// if datum > upperBound, just move to next bin +if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i +if (higherId < 0) { + if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) || +((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) { +higherId = i + } +} + } + assert(lowerId <= higherId) + val lowerBinNdv = hgmBins(lowerId).ndv + val higherBinNdv = hgmBins(higherId).ndv + // assume uniform distribution in each bin + val percent = if (lowerId == higherId) { +(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + } else { +1.0 / hgmBins.length * (higherId - lowerId - 1) + + (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + + (1.0 / hgmBins.length) / math.max(higherBinNdv, 1) + } + Some(percent) --- End diff -- Good point. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154249995 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- No. I meant the upper bound for the array of bins in a histogram. The default length of the histogram bin array is 254. When i is equal to 253 (the last bin), then i+1 is 254 leading to out-of-bound error. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154248775 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- By "out of bound", do you mean it exceeds 100 length limit? You can just switch new line after `&&` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154248457 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- If a user changes the data, statistics will be removed, or re-collected (only size currently), Spark already implements this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154225069 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- I used two statements instead of one statement is because, when i points to the last bin, this condition "value == histogram.bins(i + 1).lo" may be out of bound. By separating the conditions into two statements, we can be sure that the out-of-bound error will not happen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154223769 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- same comment as in my last reply. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r154223705 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- I hesitate to add an assert statement here. This is because an assert such as this may cause Spark system to crash if a user does not fresh his data statistics quickly. In real world, a user may load data, collect statistics, and then add more incremental data, but does not collect statistics immediately. He may issue a SQL query against his newly added data such as "WHERE column=xxx", where xxx is a new value in his incremental load. After all, statistics are auxiliary, a query should still run even the statistics are not up to date. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153975383 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -529,6 +575,55 @@ case class FilterEstimation(plan: Filter) extends Logging { Some(percent) } + /** + * Returns the selectivity percentage for a combined op-dataNumber in the column's + * current valid range [min, max] --- End diff -- Returns the selectivity percentage for a binary condition given the column's current value range. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153973312 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging { percent = 1.0 } else { // This is the partial overlap case: - // Without advanced statistics like histogram, we assume uniform data distribution. - // We just prorate the adjusted range over the initial range to compute filter selectivity. - assert(max > min) - percent = op match { -case _: LessThan => - if (numericLiteral == max) { -// If the literal value is right on the boundary, we can minus the part of the -// boundary value (1/ndv). -1.0 - 1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: LessThanOrEqual => - if (numericLiteral == min) { -// The boundary value is the only satisfying value. -1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: GreaterThan => - if (numericLiteral == min) { -1.0 - 1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } -case _: GreaterThanOrEqual => - if (numericLiteral == max) { -1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } + + if (colStat.histogram.isEmpty) { --- End diff -- move this `if` to upper level, then we can reduce code diff --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153974257 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 5) } + // The following test cases have histogram information collected for the test column + test("Not(cintHgm < 3 AND null)") { +val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)), + expectedRowCount = 9) + } + + test("cintHgm = 5") { +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm = 0") { +// This is an out-of-range case since 0 is outside the range [min, max] +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm < 3") { +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm < 0") { +// This is a corner case since literal 0 is smaller than min. +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm <= 3") { +validateEstimatedStats( + Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 6") { +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 10") { +// This is a corner case since max value is 10. +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm >= 6") { +validateEstimatedStats( + Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm IS NULL") { +validateEstimatedStats( + Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm IS NOT NULL") { +validateEstimatedStats( + Filter(IsNotNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 6, min = Some(1), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 10) + } + + test("cintHgm > 3 AND cintHgm <= 6") { +val condition = And(GreaterThan(attrIntHgm, + Literal(3)), LessThanOrEqual(attrIntHgm, Literal(6))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 5, min = Some(3), max = Some(6), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 8) + } + + test("cintHgm = 3 OR cintHgm = 6") { --- End diff -- I think we don't need test cases for combination conditions like AND, OR, NOT, because histogram doesn't affect estimation logic for them. Instead, we need to test more cases for histogram, e.g. =, >=, >, <=, <, and for
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153972336 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value + val hgmBins = colStat.histogram.get.bins + val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble + // find the interval where this datum locates --- End diff -- we can remove this comment, it's explained above --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153973342 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging { percent = 1.0 } else { --- End diff -- `else if (colStat.histogram.isEmpty)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153976879 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -529,6 +575,55 @@ case class FilterEstimation(plan: Filter) extends Logging { Some(percent) } + /** + * Returns the selectivity percentage for a combined op-dataNumber in the column's + * current valid range [min, max] + * + * @param op a binary comparison operator + * @param histogram a numeric equi-height histogram + * @param max the upper bound of the current valid range for a given column + * @param min the lower bound of the current valid range for a given column + * @param datumNumber the numeric value of a literal + * @return the selectivity percentage for a condition in the current range. + */ + + def computePercentForNumericEquiHeightHgm( --- End diff -- `computePercentByEquiHeightHgm` or just `computePercentByHistogram` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153972295 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value --- End diff -- This comment is not accurate, here we want to get the bins occupied by the literal value, because if the value is skewed, it can occupy multiple bins. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153977915 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { --- End diff -- merge two `if`s: if ((value == histogram.bins(i).hi) && (value == histogram.bins(i + 1).lo) && (i < histogram.bins.length - 1)) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153977791 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- add assert here too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153978787 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + --- End diff -- could you remove the empty line between method comment and its definition? same for other methods here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153976643 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { --- End diff -- we can just pass bin array as parameter, this can simplify the code. same for other methods. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153972847 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value + val hgmBins = colStat.histogram.get.bins + val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble + // find the interval where this datum locates + var lowerId, higherId = -1 + for (i <- hgmBins.indices) { +// if datum > upperBound, just move to next bin +if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i +if (higherId < 0) { + if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) || +((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) { +higherId = i + } +} + } + assert(lowerId <= higherId) + val lowerBinNdv = hgmBins(lowerId).ndv + val higherBinNdv = hgmBins(higherId).ndv + // assume uniform distribution in each bin + val percent = if (lowerId == higherId) { +(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + } else { +1.0 / hgmBins.length * (higherId - lowerId - 1) + + (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + + (1.0 / hgmBins.length) / math.max(higherBinNdv, 1) + } + Some(percent) --- End diff -- How about simplifying the above logic as: ``` val occupiedBins = if (lowerId == higherId) { 1.0 / lowerBinNdv } else { (higherId - lowerId - 1) + 1.0 / lowerBinNdv + 1.0 / higherBinNdv } Some(occupiedBins / hgmBins.length) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153972517 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value + val hgmBins = colStat.histogram.get.bins + val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble + // find the interval where this datum locates + var lowerId, higherId = -1 + for (i <- hgmBins.indices) { +// if datum > upperBound, just move to next bin --- End diff -- please remove the comment, it does not match the logic at next line (there's no "move" logic) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153979575 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { + // increment binId since the value appears into this bin and next bin + binId += 1 +} + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (binId == 0 && curBin.hi == curBin.lo) { + // the Min of the histogram occupies the whole first bin + 1.0 +} else if (binId == 0 && curBin.hi != curBin.lo) { + if (higherValue == lowerValue) { +// in the case curBin.binNdv == 0, current bin is occupied by one value, which +// is included in the previous bin +1.0 / math.max(curBin.ndv.toDouble, 1) + } else { +(higherValue - lowerValue) / (curBin.hi - curBin.lo) + } +} else { + if (curBin.hi == curBin.lo) { +// the entire bin is covered in the range +1.0 + } else if (higherValue == lowerValue) { +// the literal value falls in this bin +1.0 / math.max(curBin.ndv.toDouble, 1) + } else { +// Use proration since the range falls inside this bin. +math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) + } +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the selectivity percentage for column values in [lowerValue, higherValue]. + */ + + def getOccupationBins( + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +// find bins where current min and max locate +val minBinId = findFirstBinForValue(lowerEnd, histogram) +val maxBinId = findLastBinForValue(higherEnd, histogram) +assert(minBinId <= maxBinId) + +// compute how much current [min, max] occupy the histogram, in the number of bins +getOccupationBins(maxBinId, minBinId, higherEnd, lowerEnd, histogram) + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * This is an overloaded method. The column value distribution is saved in an + * equi-height histogram.
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153979438 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { + // increment binId since the value appears into this bin and next bin + binId += 1 +} + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (binId == 0 && curBin.hi == curBin.lo) { + // the Min of the histogram occupies the whole first bin + 1.0 +} else if (binId == 0 && curBin.hi != curBin.lo) { + if (higherValue == lowerValue) { +// in the case curBin.binNdv == 0, current bin is occupied by one value, which --- End diff -- binNdv will never be zero --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153976662 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { --- End diff -- i <- bins.indices --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153970902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information --- End diff -- move this comment where the histogram computation really starts --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153975698 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -784,11 +879,16 @@ case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat]) { def outputColumnStats(rowsBeforeFilter: BigInt, rowsAfterFilter: BigInt) : AttributeMap[ColumnStat] = { val newColumnStats = originalMap.map { case (attr, oriColStat) => - // Update ndv based on the overall filter selectivity: scale down ndv if the number of rows - // decreases; otherwise keep it unchanged. - val newNdv = EstimationUtils.updateNdv(oldNumRows = rowsBeforeFilter, -newNumRows = rowsAfterFilter, oldNdv = oriColStat.distinctCount) val colStat = updatedMap.get(attr.exprId).map(_._2).getOrElse(oriColStat) + val newNdv = if (colStat.distinctCount > 1) { --- End diff -- no need to add extra check here, in `EstimationUtils.updateNdv` we already check this case. We can just revert the change here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153973781 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -578,6 +590,112 @@ class FilterEstimationSuite extends StatsEstimationTestBase { expectedRowCount = 5) } + // The following test cases have histogram information collected for the test column + test("Not(cintHgm < 3 AND null)") { +val condition = Not(And(LessThan(attrIntHgm, Literal(3)), Literal(null, IntegerType))) +validateEstimatedStats( + Filter(condition, childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> colStatIntHgm.copy(distinctCount = 6)), + expectedRowCount = 9) + } + + test("cintHgm = 5") { +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(5)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(5), max = Some(5), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm = 0") { +// This is an out-of-range case since 0 is outside the range [min, max] +validateEstimatedStats( + Filter(EqualTo(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm < 3") { +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm < 0") { +// This is a corner case since literal 0 is smaller than min. +validateEstimatedStats( + Filter(LessThan(attrIntHgm, Literal(0)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm <= 3") { +validateEstimatedStats( + Filter(LessThanOrEqual(attrIntHgm, Literal(3)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 1, min = Some(1), max = Some(3), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 6") { +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 2, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 2) + } + + test("cintHgm > 10") { +// This is a corner case since max value is 10. +validateEstimatedStats( + Filter(GreaterThan(attrIntHgm, Literal(10)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm >= 6") { +validateEstimatedStats( + Filter(GreaterThanOrEqual(attrIntHgm, Literal(6)), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Seq(attrIntHgm -> ColumnStat(distinctCount = 3, min = Some(6), max = Some(10), +nullCount = 0, avgLen = 4, maxLen = 4, histogram = Some(hgmInt))), + expectedRowCount = 4) + } + + test("cintHgm IS NULL") { +validateEstimatedStats( + Filter(IsNull(attrIntHgm), childStatsTestPlan(Seq(attrIntHgm), 10L)), + Nil, + expectedRowCount = 0) + } + + test("cintHgm IS NOT NULL") { --- End diff -- histogram does not affect estimation logic for `IS NULL` and `IS NOT NULL` filter conditions, we can remove these two test cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153977299 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 --- End diff -- for robustness, add `assert(bins.head.lo <= value && bins.last.hi >= value)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153977529 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { + // increment binId since the value appears into this bin and next bin --- End diff -- appears in both this bin and the next bin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153974940 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -471,37 +508,47 @@ case class FilterEstimation(plan: Filter) extends Logging { percent = 1.0 } else { // This is the partial overlap case: - // Without advanced statistics like histogram, we assume uniform data distribution. - // We just prorate the adjusted range over the initial range to compute filter selectivity. - assert(max > min) - percent = op match { -case _: LessThan => - if (numericLiteral == max) { -// If the literal value is right on the boundary, we can minus the part of the -// boundary value (1/ndv). -1.0 - 1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: LessThanOrEqual => - if (numericLiteral == min) { -// The boundary value is the only satisfying value. -1.0 / ndv - } else { -(numericLiteral - min) / (max - min) - } -case _: GreaterThan => - if (numericLiteral == min) { -1.0 - 1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } -case _: GreaterThanOrEqual => - if (numericLiteral == max) { -1.0 / ndv - } else { -(max - numericLiteral) / (max - min) - } + + if (colStat.histogram.isEmpty) { +// Without advanced statistics like histogram, we assume uniform data distribution. +// We just prorate the adjusted range over the initial range to compute filter selectivity. +assert(max > min) +percent = op match { + case _: LessThan => +if (numericLiteral == max) { + // If the literal value is right on the boundary, we can minus the part of the + // boundary value (1/ndv). + 1.0 - 1.0 / ndv +} else { + (numericLiteral - min) / (max - min) +} + case _: LessThanOrEqual => +if (numericLiteral == min) { + // The boundary value is the only satisfying value. + 1.0 / ndv +} else { + (numericLiteral - min) / (max - min) +} + case _: GreaterThan => +if (numericLiteral == min) { + 1.0 - 1.0 / ndv +} else { + (max - numericLiteral) / (max - min) +} + case _: GreaterThanOrEqual => +if (numericLiteral == max) { + 1.0 / ndv +} else { + (max - numericLiteral) / (max - min) +} +} + } else { +val numericHistogram = colStat.histogram.get +val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble +val maxDouble = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble +val minDouble = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble --- End diff -- `max, min` is good enough I think --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153973474 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -513,10 +560,9 @@ case class FilterEstimation(plan: Filter) extends Logging { op match { case _: GreaterThan | _: GreaterThanOrEqual => -// If new ndv is 1, then new max must be equal to new min. -newMin = if (newNdv == 1) newMax else newValue +newMin = newValue case _: LessThan | _: LessThanOrEqual => -newMax = if (newNdv == 1) newMin else newValue +newMax = newValue --- End diff -- why change these two line? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153973544 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -359,7 +371,7 @@ class FilterEstimationSuite extends StatsEstimationTestBase { test("cbool > false") { validateEstimatedStats( Filter(GreaterThan(attrBool, Literal(false)), childStatsTestPlan(Seq(attrBool), 10L)), - Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(true), max = Some(true), + Seq(attrBool -> ColumnStat(distinctCount = 1, min = Some(false), max = Some(true), --- End diff -- why the result is changed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153972982 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { +case StringType | BinaryType => + Some(1.0 / BigDecimal(ndv)) + +case _ => + // returns 1/ndv if there is no histogram + if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) + + // We traverse histogram bins to locate the literal value + val hgmBins = colStat.histogram.get.bins + val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble + // find the interval where this datum locates + var lowerId, higherId = -1 + for (i <- hgmBins.indices) { +// if datum > upperBound, just move to next bin +if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i +if (higherId < 0) { + if ((datum < hgmBins(i).hi || i == hgmBins.length - 1) || +((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) { +higherId = i + } +} + } + assert(lowerId <= higherId) + val lowerBinNdv = hgmBins(lowerId).ndv + val higherBinNdv = hgmBins(higherId).ndv + // assume uniform distribution in each bin + val percent = if (lowerId == higherId) { +(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + } else { +1.0 / hgmBins.length * (higherId - lowerId - 1) + + (1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + + (1.0 / hgmBins.length) / math.max(higherBinNdv, 1) --- End diff -- bin's ndv will never be less than 1, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153979157 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin into which a column values falls. + */ + + def findFirstBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +histogram.bins.foreach { bin => + if (value > bin.hi) binId += 1 +} +binId + } + + /** + * Returns the number of the last bin into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the last bin into which a column values falls. + */ + + def findLastBinForValue(value: Double, histogram: Histogram): Int = { +var binId = 0 +for (i <- 0 until histogram.bins.length) { + if (value > histogram.bins(i).hi) { +// increment binId to point to next bin +binId += 1 + } + if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { +if (value == histogram.bins(i + 1).lo) { + // increment binId since the value appears into this bin and next bin + binId += 1 +} + } +} +binId + } + + /** + * Returns a percentage of a bin holding values for column value in the range of + * [lowerValue, higherValue] + * + * @param binId a given bin id in a specified histogram + * @param higherValue a given upper bound value of a specified column value range + * @param lowerValue a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the percentage of a single bin holding values in [lowerValue, higherValue]. + */ + + private def getOccupation( + binId: Int, + higherValue: Double, + lowerValue: Double, + histogram: Histogram): Double = { +val curBin = histogram.bins(binId) +if (binId == 0 && curBin.hi == curBin.lo) { + // the Min of the histogram occupies the whole first bin + 1.0 +} else if (binId == 0 && curBin.hi != curBin.lo) { + if (higherValue == lowerValue) { +// in the case curBin.binNdv == 0, current bin is occupied by one value, which +// is included in the previous bin +1.0 / math.max(curBin.ndv.toDouble, 1) + } else { +(higherValue - lowerValue) / (curBin.hi - curBin.lo) + } +} else { + if (curBin.hi == curBin.lo) { +// the entire bin is covered in the range +1.0 + } else if (higherValue == lowerValue) { +// the literal value falls in this bin +1.0 / math.max(curBin.ndv.toDouble, 1) + } else { +// Use proration since the range falls inside this bin. +math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0) + } +} + } + + /** + * Returns the number of bins for column values in [lowerValue, higherValue]. + * The column value distribution is saved in an equi-height histogram. + * + * @param higherEnd a given upper bound value of a specified column value range + * @param lowerEnd a given lower bound value of a specified column value range + * @param histogram a numeric equi-height histogram + * @return the selectivity percentage for column values in [lowerValue, higherValue]. + */ + + def getOccupationBins( + higherEnd: Double, + lowerEnd: Double, + histogram: Histogram): Double = { +// find bins where current min and max locate +val minBinId = findFirstBinForValue(lowerEnd, histogram) +val maxBinId = findLastBinForValue(higherEnd, histogram) --- End diff -- how about `lowerBinId, higherBinId`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153970815 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { colStatsMap.update(attr, newStats) } - Some(1.0 / BigDecimal(ndv)) -} else { + // We compute filter selectivity using Histogram information + attr.dataType match { --- End diff -- use ` if (colStat.histogram.isEmpty)` to seperate the logic of basic stats (`Some(1.0 / BigDecimal(ndv))`) and histogram computation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153902382 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin/bucket into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin/bucket into which a column values falls. + */ + + def findFirstBucketForValue(value: Double, histogram: Histogram): Int = { --- End diff -- We had bucket(s) and bin(s) used interchangeably. To avoid confusion, I will unify them to use only bin/bins. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r153686107 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala --- @@ -114,4 +114,197 @@ object EstimationUtils { } } + /** + * Returns the number of the first bin/bucket into which a column values falls for a specified + * numeric equi-height histogram. + * + * @param value a literal value of a column + * @param histogram a numeric equi-height histogram + * @return the number of the first bin/bucket into which a column values falls. + */ + + def findFirstBucketForValue(value: Double, histogram: Histogram): Int = { --- End diff -- Shall we unify all names to `bin`/`bins` in code and comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r151979117 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -158,8 +196,8 @@ class FilterEstimationSuite extends StatsEstimationTestBase { val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, IntegerType))) validateEstimatedStats( Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)), - Seq(attrInt -> colStatInt.copy(distinctCount = 8)), - expectedRowCount = 8) + Seq(attrInt -> colStatInt.copy(distinctCount = 7)), --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19783#discussion_r151890602 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala --- @@ -158,8 +196,8 @@ class FilterEstimationSuite extends StatsEstimationTestBase { val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, IntegerType))) validateEstimatedStats( Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)), - Seq(attrInt -> colStatInt.copy(distinctCount = 8)), - expectedRowCount = 8) + Seq(attrInt -> colStatInt.copy(distinctCount = 7)), --- End diff -- Shall we add new test cases for filter estimation based on histogram, instead of modifying existing test results? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org