[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-18 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r634746436



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,39 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {

Review comment:
   > Trivial: you could put a return type here.Yes, I added the return type
   
   Yes, I updated with the return type.
   
   > Is there any concern with the overhead of computing the histogram? It only 
happens when enabled of course. I just wonder if we need to micro-optimize the 
loop below with something more performant if this is critical, but maybe it 
isn't.
   
   This is not a performance intensive path. This will execute during planning 
phase. Also the loop size is numBins, which is 254 by default. So I am not sure 
we need to micro-optimise the loop?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-17 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r633897836



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,40 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+val lowerIndexInitial = percentileArray.head
+val lowerBinValueInitial = getRangeValue(0)
+percentileArray.tail.zipWithIndex
+  .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) {
+case ((lowerIndex, lowerBinValue), (upperIndex, binId)) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+  // Update the lowerIndex and lowerBinValue with upper ones for the 
next iteration.
+  (upperIndex, upperBinValue)
+  }
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {
+assert(index >= 0, "index must be greater than and equal to 0")
+if (step <= 0) {
+  start + (numElements.toLong - index - 1) * step

Review comment:
   Yes. I added a comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-17 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r633897836



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,40 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+val lowerIndexInitial = percentileArray.head
+val lowerBinValueInitial = getRangeValue(0)
+percentileArray.tail.zipWithIndex
+  .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) {
+case ((lowerIndex, lowerBinValue), (upperIndex, binId)) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+  // Update the lowerIndex and lowerBinValue with upper ones for the 
next iteration.
+  (upperIndex, upperBinValue)
+  }
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {
+assert(index >= 0, "index must be greater than and equal to 0")
+if (step <= 0) {
+  start + (numElements.toLong - index - 1) * step

Review comment:
   Yes. Sure I will add a comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-17 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r633858095



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,40 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+val lowerIndexInitial = percentileArray.head
+val lowerBinValueInitial = getRangeValue(0)
+percentileArray.tail.zipWithIndex
+  .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) {
+case ((lowerIndex, lowerBinValue), (upperIndex, binId)) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+  // Update the lowerIndex and lowerBinValue with upper ones for the 
next iteration.
+  (upperIndex, upperBinValue)
+  }
+Histogram(height, binArray)

Review comment:
   Updated. Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-17 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r633857010



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,40 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+val lowerIndexInitial = percentileArray.head
+val lowerBinValueInitial = getRangeValue(0)
+percentileArray.tail.zipWithIndex
+  .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) {
+case ((lowerIndex, lowerBinValue), (upperIndex, binId)) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+  // Update the lowerIndex and lowerBinValue with upper ones for the 
next iteration.
+  (upperIndex, upperBinValue)
+  }
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {
+assert(index >= 0, "index must be greater than and equal to 0")
+if (step <= 0) {
+  start + (numElements.toLong - index - 1) * step

Review comment:
   >Also I assume step = 0 is an error actually, maybe it's caught earlier, 
but the result should be start always in that case anyway, so step < 0 as a 
condition?
   
   Yes, if step=0 both the conditions will return same start element. I have 
updated the condition to step < 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-17 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r633855607



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,40 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+val lowerIndexInitial = percentileArray.head
+val lowerBinValueInitial = getRangeValue(0)
+percentileArray.tail.zipWithIndex
+  .foldLeft[(Double, Long)]((lowerIndexInitial, lowerBinValueInitial)) {
+case ((lowerIndex, lowerBinValue), (upperIndex, binId)) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+  // Update the lowerIndex and lowerBinValue with upper ones for the 
next iteration.
+  (upperIndex, upperBinValue)
+  }
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {
+assert(index >= 0, "index must be greater than and equal to 0")
+if (step <= 0) {
+  start + (numElements.toLong - index - 1) * step

Review comment:
   >This tries to step backwards from the end, right? step is negative
   
   Yes, we reverse the range values if the step is negative to compute 
histogram statistics.
   
   >I'm probably missing something, but isn't this more like:
   start + (numElements.toLong - 1) * (-step) + index * step
   
   I am not sure this gives the right range values if the step is negative. 
   For eg: start = 1 , step = -2, and numElements = 4, Range values  = [1, -1, 
-3, -5]
   But for computing histogram we need values like this [-5, -3, -1, 1]. So if 
the index = 0, it should return -5.
   
   With the formula in the PR => 1 + (4 - 0 -1 )* -2= -5
With the formula you have given above  => 1 + (4 -1) *-(-2) + 0 = 7
   
   So I think the formula in the PR seems correct?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631566208



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
   private def checkStats(
   plan: LogicalPlan,
   expectedStatsCboOn: Statistics,
-  expectedStatsCboOff: Statistics): Unit = {
-withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  expectedStatsCboOff: Statistics,
+  extraConfigs: Map[String, String] = Map.empty): Unit = {
+

Review comment:
   Yes, removed the extra line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631565143



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
   private def checkStats(
   plan: LogicalPlan,
   expectedStatsCboOn: Statistics,
-  expectedStatsCboOff: Statistics): Unit = {
-withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  expectedStatsCboOff: Statistics,
+  extraConfigs: Map[String, String] = Map.empty): Unit = {
+

Review comment:
   I am not sure I understand you here. Do we need to directly put the 
histogram configs inside this method? By default histogram is disabled and 
number of bins default value is 254.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631564790



##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -77,12 +92,21 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(4),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step") {
 val range = Range(-10, -20, -2, None)
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-18.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -12.0, 2)
+histogramBins(2) = HistogramBin(-12.0, -10.0, 1)

Review comment:
   Added assert to check if `range.numElements` and `ndv` are same

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(-10),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step where end minus start not divisible by step") 
{
+
 val range = Range(-10, -20, -3, None)
+
+val histogramBins = new Array[HistogramBin](3)
+histogramBins(0) = HistogramBin(-19.0, -16.0, 2)
+histogramBins(1) = HistogramBin(-16.0, -13.0, 1)
+histogramBins(2) = HistogramBin(-13.0, -10.0, 1)

Review comment:
   Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631564612



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,38 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray.head
+var lowerBinValue = getRangeValue(0)
+percentileArray.tail.zipWithIndex.foreach { case (upperIndex, binId) =>
+  // Integer index for upper and lower values in the bin.
+  val upperIndexPos = math.ceil(upperIndex).toInt - 1
+  val lowerIndexPos = math.ceil(lowerIndex).toInt - 1
+
+  val upperBinValue = getRangeValue(math.max(upperIndexPos, 0))
+  val ndv = math.max(upperIndexPos - lowerIndexPos, 1)
+  binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv)
+
+  lowerBinValue = upperBinValue
+  lowerIndex = upperIndex
+}
+Histogram(height, binArray)
+  }
+
+  // Utility method to compute histogram
+  private def getRangeValue(index: Int): Long = {

Review comment:
   Done

##
File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
##
@@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with 
StatsEstimationTestBase {
 max = Some(-10),
 nullCount = Some(0),
 maxLen = Some(LongType.defaultSize),
-avgLen = Some(LongType.defaultSize))
-checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = 
rangeStats)
+avgLen = Some(LongType.defaultSize),
+histogram = histogram)
+val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true",
+  SQLConf.HISTOGRAM_NUM_BINS.key -> "3")
+checkStats(range, expectedStatsCboOn = rangeStats,
+  expectedStatsCboOff = rangeStats, extraConfig)
   }
 
   test("range with negative step where end minus start not divisible by step") 
{
+

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631564557



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +797,38 @@ case class Range(
 }
   }
 
+  private def computeHistogramStatistics() = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray.head
+var lowerBinValue = getRangeValue(0)

Review comment:
   Yes, updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631536454



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +793,36 @@ case class Range(
 }
   }
 
+  private def getHistogramStatistics = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+var binId = 0
+val binArray = new Array[HistogramBin](numBins)
+var lowerIndex = percentileArray(0)
+var lowerBinValue = getRangeValue(0)
+while (binId < numBins) {
+  val upperIndex = percentileArray(binId + 1)
+  val upperBinValue = getRangeValue(math.max(math.ceil(upperIndex).toInt - 
1, 0))
+  val ndv = math.max(math.ceil(upperIndex).toInt - 
math.ceil(lowerIndex).toInt, 1)

Review comment:
   Yes, updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631536413



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -774,13 +774,17 @@ case class Range(
   } else {
 (start + (numElements - 1) * step, start)
   }
+
+  val histogram = getHistogramStatistics

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631536346



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +793,36 @@ case class Range(
 }
   }
 
+  private def getHistogramStatistics = {
+val numBins = conf.histogramNumBins
+val height = numElements.toDouble / numBins
+val percentileArray = (0 to numBins).map(i => i * height).toArray
+
+var binId = 0

Review comment:
   Yes, I updated this with `zipWithIndex`

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -789,6 +793,36 @@ case class Range(
 }
   }
 
+  private def getHistogramStatistics = {

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631385071



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -759,6 +759,10 @@ case class Range(
   }
 
   override def computeStats(): Statistics = {
+if (!conf.cboEnabled) {
+  return Statistics(sizeInBytes = LongType.defaultSize * numElements)
+}
+

Review comment:
   I removed histogram config check as well. Because histogram information 
is readily available. Also refactored the code and UTs




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631385071



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -759,6 +759,10 @@ case class Range(
   }
 
   override def computeStats(): Statistics = {
+if (!conf.cboEnabled) {
+  return Statistics(sizeInBytes = LongType.defaultSize * numElements)
+}
+

Review comment:
   I removed histogram config check as well. Because histogram information 
is readily available. Also refactored the code and Its




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631162814



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -759,6 +759,10 @@ case class Range(
   }
 
   override def computeStats(): Statistics = {
+if (!conf.cboEnabled) {
+  return Statistics(sizeInBytes = LongType.defaultSize * numElements)
+}
+

Review comment:
   I removed the condition. I have added the condition earlier because, if 
cbo is not enabled, other operators are not returning column statistics. But 
yes, in this case we have all the statistics readily available. Do we also need 
to remove histogram check as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631162814



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -759,6 +759,10 @@ case class Range(
   }
 
   override def computeStats(): Statistics = {
+if (!conf.cboEnabled) {
+  return Statistics(sizeInBytes = LongType.defaultSize * numElements)
+}
+

Review comment:
   I removed the condition. I have added the condition earlier because, if 
cbo is not enabled, other operators are not returning other statistics. But 
yes, in this case we have all the statistics readily available. Do we also need 
to remove histogram check as well?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -767,14 +771,44 @@ case class Range(
   } else {
 (start + (numElements - 1) * step, start)
   }
-  val colStat = ColumnStat(
+  var colStat = ColumnStat(

Review comment:
   Done

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -767,14 +771,44 @@ case class Range(
   } else {
 (start + (numElements - 1) * step, start)
   }
-  val colStat = ColumnStat(
+  var colStat = ColumnStat(
 distinctCount = Some(numElements),
 max = Some(maxVal),
 min = Some(minVal),
 nullCount = Some(0),
 avgLen = Some(LongType.defaultSize),
 maxLen = Some(LongType.defaultSize))
 
+  if (conf.histogramEnabled) {
+

Review comment:
   Removed

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -767,14 +771,44 @@ case class Range(
   } else {
 (start + (numElements - 1) * step, start)
   }
-  val colStat = ColumnStat(
+  var colStat = ColumnStat(
 distinctCount = Some(numElements),
 max = Some(maxVal),
 min = Some(minVal),
 nullCount = Some(0),
 avgLen = Some(LongType.defaultSize),
 maxLen = Some(LongType.defaultSize))
 
+  if (conf.histogramEnabled) {
+
+def getRangeValue(index: Int): Long = {

Review comment:
   Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation

2021-05-12 Thread GitBox


shahidki31 commented on a change in pull request #32498:
URL: https://github.com/apache/spark/pull/32498#discussion_r631162814



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -759,6 +759,10 @@ case class Range(
   }
 
   override def computeStats(): Statistics = {
+if (!conf.cboEnabled) {
+  return Statistics(sizeInBytes = LongType.defaultSize * numElements)
+}
+

Review comment:
   I removed the condition as we have all the statistics available for the 
operator. I have added the condition earlier because, if cbo is not enabled, 
other operators are not returning other statistics. Do we also need to remove 
histogram check as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org