[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r634746436 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,39 @@ case class Range( } } + private def computeHistogramStatistics() = { Review comment: > Trivial: you could put a return type here.Yes, I added the return type Yes, I updated with the return type. > Is there any concern with the overhead of computing the histogram? It only happens when enabled of course. I just wonder if we need to micro-optimize the loop below with something more performant if this is critical, but maybe it isn't. This is not a performance intensive path. This will execute during planning phase. Also the loop size is numBins, which is 254 by default. So I am not sure we need to micro-optimise the loop? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r633897836 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,40 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +val lowerIndexInitial = percentileArray.head +val lowerBinValueInitial = getRangeValue(0) +percentileArray.tail.zipWithIndex + .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) { +case ((lowerIndex, lowerBinValue), (upperIndex, binId)) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + // Update the lowerIndex and lowerBinValue with upper ones for the next iteration. + (upperIndex, upperBinValue) + } +Histogram(height, binArray) + } + + // Utility method to compute histogram + private def getRangeValue(index: Int): Long = { +assert(index >= 0, "index must be greater than and equal to 0") +if (step <= 0) { + start + (numElements.toLong - index - 1) * step Review comment: Yes. I added a comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r633897836 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,40 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +val lowerIndexInitial = percentileArray.head +val lowerBinValueInitial = getRangeValue(0) +percentileArray.tail.zipWithIndex + .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) { +case ((lowerIndex, lowerBinValue), (upperIndex, binId)) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + // Update the lowerIndex and lowerBinValue with upper ones for the next iteration. + (upperIndex, upperBinValue) + } +Histogram(height, binArray) + } + + // Utility method to compute histogram + private def getRangeValue(index: Int): Long = { +assert(index >= 0, "index must be greater than and equal to 0") +if (step <= 0) { + start + (numElements.toLong - index - 1) * step Review comment: Yes. Sure I will add a comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r633858095 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,40 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +val lowerIndexInitial = percentileArray.head +val lowerBinValueInitial = getRangeValue(0) +percentileArray.tail.zipWithIndex + .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) { +case ((lowerIndex, lowerBinValue), (upperIndex, binId)) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + // Update the lowerIndex and lowerBinValue with upper ones for the next iteration. + (upperIndex, upperBinValue) + } +Histogram(height, binArray) Review comment: Updated. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r633857010 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,40 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +val lowerIndexInitial = percentileArray.head +val lowerBinValueInitial = getRangeValue(0) +percentileArray.tail.zipWithIndex + .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) { +case ((lowerIndex, lowerBinValue), (upperIndex, binId)) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + // Update the lowerIndex and lowerBinValue with upper ones for the next iteration. + (upperIndex, upperBinValue) + } +Histogram(height, binArray) + } + + // Utility method to compute histogram + private def getRangeValue(index: Int): Long = { +assert(index >= 0, "index must be greater than and equal to 0") +if (step <= 0) { + start + (numElements.toLong - index - 1) * step Review comment: >Also I assume step = 0 is an error actually, maybe it's caught earlier, but the result should be start always in that case anyway, so step < 0 as a condition? Yes, if step=0 both the conditions will return same start element. I have updated the condition to step < 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r633855607 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,40 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +val lowerIndexInitial = percentileArray.head +val lowerBinValueInitial = getRangeValue(0) +percentileArray.tail.zipWithIndex + .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) { +case ((lowerIndex, lowerBinValue), (upperIndex, binId)) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + // Update the lowerIndex and lowerBinValue with upper ones for the next iteration. + (upperIndex, upperBinValue) + } +Histogram(height, binArray) + } + + // Utility method to compute histogram + private def getRangeValue(index: Int): Long = { +assert(index >= 0, "index must be greater than and equal to 0") +if (step <= 0) { + start + (numElements.toLong - index - 1) * step Review comment: >This tries to step backwards from the end, right? step is negative Yes, we reverse the range values if the step is negative to compute histogram statistics. >I'm probably missing something, but isn't this more like: start + (numElements.toLong - 1) * (-step) + index * step I am not sure this gives the right range values if the step is negative. For eg: start = 1 , step = -2, and numElements = 4, Range values = [1, -1, -3, -5] But for computing histogram we need values like this [-5, -3, -1, 1]. So if the index = 0, it should return -5. With the formula in the PR => 1 + (4 - 0 -1 )* -2= -5 With the formula you have given above => 1 + (4 -1) *-(-2) + 0 = 7 So I think the formula in the PR seems correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631566208 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { private def checkStats( plan: LogicalPlan, expectedStatsCboOn: Statistics, - expectedStatsCboOff: Statistics): Unit = { -withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + expectedStatsCboOff: Statistics, + extraConfigs: Map[String, String] = Map.empty): Unit = { + Review comment: Yes, removed the extra line -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631565143 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { private def checkStats( plan: LogicalPlan, expectedStatsCboOn: Statistics, - expectedStatsCboOff: Statistics): Unit = { -withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + expectedStatsCboOff: Statistics, + extraConfigs: Map[String, String] = Map.empty): Unit = { + Review comment: I am not sure I understand you here. Do we need to directly put the histogram configs inside this method? By default histogram is disabled and number of bins default value is 254. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631564790 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -77,12 +92,21 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { max = Some(4), nullCount = Some(0), maxLen = Some(LongType.defaultSize), -avgLen = Some(LongType.defaultSize)) -checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = rangeStats) +avgLen = Some(LongType.defaultSize), +histogram = histogram) +val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true", + SQLConf.HISTOGRAM_NUM_BINS.key -> "3") +checkStats(range, expectedStatsCboOn = rangeStats, + expectedStatsCboOff = rangeStats, extraConfig) } test("range with negative step") { val range = Range(-10, -20, -2, None) +val histogramBins = new Array[HistogramBin](3) +histogramBins(0) = HistogramBin(-18.0, -16.0, 2) +histogramBins(1) = HistogramBin(-16.0, -12.0, 2) +histogramBins(2) = HistogramBin(-12.0, -10.0, 1) Review comment: Added assert to check if `range.numElements` and `ndv` are same ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { max = Some(-10), nullCount = Some(0), maxLen = Some(LongType.defaultSize), -avgLen = Some(LongType.defaultSize)) -checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = rangeStats) +avgLen = Some(LongType.defaultSize), +histogram = histogram) +val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true", + SQLConf.HISTOGRAM_NUM_BINS.key -> "3") +checkStats(range, expectedStatsCboOn = rangeStats, + expectedStatsCboOff = rangeStats, extraConfig) } test("range with negative step where end minus start not divisible by step") { + val range = Range(-10, -20, -3, None) + +val histogramBins = new Array[HistogramBin](3) +histogramBins(0) = HistogramBin(-19.0, -16.0, 2) +histogramBins(1) = HistogramBin(-16.0, -13.0, 1) +histogramBins(2) = HistogramBin(-13.0, -10.0, 1) Review comment: Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631564612 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,38 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +var lowerIndex = percentileArray.head +var lowerBinValue = getRangeValue(0) +percentileArray.tail.zipWithIndex.foreach { case (upperIndex, binId) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + + lowerBinValue = upperBinValue + lowerIndex = upperIndex +} +Histogram(height, binArray) + } + + // Utility method to compute histogram + private def getRangeValue(index: Int): Long = { Review comment: Done ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { max = Some(-10), nullCount = Some(0), maxLen = Some(LongType.defaultSize), -avgLen = Some(LongType.defaultSize)) -checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = rangeStats) +avgLen = Some(LongType.defaultSize), +histogram = histogram) +val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true", + SQLConf.HISTOGRAM_NUM_BINS.key -> "3") +checkStats(range, expectedStatsCboOn = rangeStats, + expectedStatsCboOff = rangeStats, extraConfig) } test("range with negative step where end minus start not divisible by step") { + Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631564557 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,38 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +var lowerIndex = percentileArray.head +var lowerBinValue = getRangeValue(0) Review comment: Yes, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631536454 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +793,36 @@ case class Range( } } + private def getHistogramStatistics = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +var binId = 0 +val binArray = new Array[HistogramBin](numBins) +var lowerIndex = percentileArray(0) +var lowerBinValue = getRangeValue(0) +while (binId < numBins) { + val upperIndex = percentileArray(binId + 1) + val upperBinValue = getRangeValue(math.max(math.ceil(upperIndex).toInt - 1, 0)) + val ndv = math.max(math.ceil(upperIndex).toInt - math.ceil(lowerIndex).toInt, 1) Review comment: Yes, updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631536413 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -774,13 +774,17 @@ case class Range( } else { (start + (numElements - 1) * step, start) } + + val histogram = getHistogramStatistics Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631536346 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +793,36 @@ case class Range( } } + private def getHistogramStatistics = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +var binId = 0 Review comment: Yes, I updated this with `zipWithIndex` ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +793,36 @@ case class Range( } } + private def getHistogramStatistics = { Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631385071 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -759,6 +759,10 @@ case class Range( } override def computeStats(): Statistics = { +if (!conf.cboEnabled) { + return Statistics(sizeInBytes = LongType.defaultSize * numElements) +} + Review comment: I removed histogram config check as well. Because histogram information is readily available. Also refactored the code and UTs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631385071 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -759,6 +759,10 @@ case class Range( } override def computeStats(): Statistics = { +if (!conf.cboEnabled) { + return Statistics(sizeInBytes = LongType.defaultSize * numElements) +} + Review comment: I removed histogram config check as well. Because histogram information is readily available. Also refactored the code and Its -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631162814 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -759,6 +759,10 @@ case class Range( } override def computeStats(): Statistics = { +if (!conf.cboEnabled) { + return Statistics(sizeInBytes = LongType.defaultSize * numElements) +} + Review comment: I removed the condition. I have added the condition earlier because, if cbo is not enabled, other operators are not returning column statistics. But yes, in this case we have all the statistics readily available. Do we also need to remove histogram check as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631162814 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -759,6 +759,10 @@ case class Range( } override def computeStats(): Statistics = { +if (!conf.cboEnabled) { + return Statistics(sizeInBytes = LongType.defaultSize * numElements) +} + Review comment: I removed the condition. I have added the condition earlier because, if cbo is not enabled, other operators are not returning other statistics. But yes, in this case we have all the statistics readily available. Do we also need to remove histogram check as well? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -767,14 +771,44 @@ case class Range( } else { (start + (numElements - 1) * step, start) } - val colStat = ColumnStat( + var colStat = ColumnStat( Review comment: Done ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -767,14 +771,44 @@ case class Range( } else { (start + (numElements - 1) * step, start) } - val colStat = ColumnStat( + var colStat = ColumnStat( distinctCount = Some(numElements), max = Some(maxVal), min = Some(minVal), nullCount = Some(0), avgLen = Some(LongType.defaultSize), maxLen = Some(LongType.defaultSize)) + if (conf.histogramEnabled) { + Review comment: Removed ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -767,14 +771,44 @@ case class Range( } else { (start + (numElements - 1) * step, start) } - val colStat = ColumnStat( + var colStat = ColumnStat( distinctCount = Some(numElements), max = Some(maxVal), min = Some(minVal), nullCount = Some(0), avgLen = Some(LongType.defaultSize), maxLen = Some(LongType.defaultSize)) + if (conf.histogramEnabled) { + +def getRangeValue(index: Int): Long = { Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631162814 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -759,6 +759,10 @@ case class Range( } override def computeStats(): Statistics = { +if (!conf.cboEnabled) { + return Statistics(sizeInBytes = LongType.defaultSize * numElements) +} + Review comment: I removed the condition as we have all the statistics available for the operator. I have added the condition earlier because, if cbo is not enabled, other operators are not returning other statistics. Do we also need to remove histogram check as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org