[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r104098256 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo * @param condition the compound logical expression * @param update a boolean flag to specify if we need to update ColumnStat of a column * for subsequent conditions - * @return a double value to show the percentage of rows meeting a given condition. + * @return an optional double value to show the percentage of rows meeting a given condition. * It returns None if the condition is not supported. */ def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = { - condition match { case And(cond1, cond2) => -(calculateFilterSelectivity(cond1, update), calculateFilterSelectivity(cond2, update)) -match { +// For ease of debugging, we compute percent1 and percent2 in 2 statements. +val percent1 = calculateFilterSelectivity(cond1, update) +val percent2 = calculateFilterSelectivity(cond2, update) +(percent1, percent2) match { case (Some(p1), Some(p2)) => Some(p1 * p2) case (Some(p1), None) => Some(p1) --- End diff -- @cloud-fan @ron8hu I'm a little confused about this, for Not expression, it always becomes under-estimation if we do over-estimation, no matter it's nested or not. So should we remove support for `nested Not` or `Not`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17065 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103097107 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo * @param condition the compound logical expression * @param update a boolean flag to specify if we need to update ColumnStat of a column * for subsequent conditions - * @return a double value to show the percentage of rows meeting a given condition. + * @return an optional double value to show the percentage of rows meeting a given condition. * It returns None if the condition is not supported. */ def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = { - condition match { case And(cond1, cond2) => -(calculateFilterSelectivity(cond1, update), calculateFilterSelectivity(cond2, update)) -match { +// For ease of debugging, we compute percent1 and percent2 in 2 statements. +val percent1 = calculateFilterSelectivity(cond1, update) +val percent2 = calculateFilterSelectivity(cond2, update) +(percent1, percent2) match { case (Some(p1), Some(p2)) => Some(p1 * p2) case (Some(p1), None) => Some(p1) --- End diff -- SGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103090517 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -361,57 +343,52 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo */ def evaluateInSet( - attrRef: AttributeReference, + attr: Attribute, hSet: Set[Any], - update: Boolean) -: Option[Double] = { -if (!mutableColStats.contains(attrRef.exprId)) { - logDebug("[CBO] No statistics for " + attrRef) + update: Boolean): Option[Double] = { +if (!colStatsMap.contains(attr)) { + logDebug("[CBO] No statistics for " + attr) return None } -val aColStat = mutableColStats(attrRef.exprId) -val ndv = aColStat.distinctCount -val aType = attrRef.dataType -var newNdv: Long = 0 +val colStat = colStatsMap(attr) +val ndv = colStat.distinctCount +val dataType = attr.dataType +var newNdv = ndv // use [min, max] to filter the original hSet -aType match { - case _: NumericType | DateType | TimestampType => -val statsRange = - Range(aColStat.min, aColStat.max, aType).asInstanceOf[NumericRange] - -// To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal. -// Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec. -val hSetBigdec = hSet.map(e => BigDecimal(e.toString)) -val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && e <= statsRange.max) -// We use hSetBigdecToAnyMap to help us find the original hSet value. -val hSetBigdecToAnyMap: Map[BigDecimal, Any] = - hSet.map(e => BigDecimal(e.toString) -> e).toMap +dataType match { + case _: NumericType | BooleanType | DateType | TimestampType => +val statsRange = Range(colStat.min, colStat.max, dataType).asInstanceOf[NumericRange] +val validQuerySet = hSet.filter { v => + v != null && statsRange.contains(Literal(v, dataType)) --- End diff -- nit: for better readability: (v != null) && statsRange.contains(Literal(v, dataType)) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103087483 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo * @param condition the compound logical expression * @param update a boolean flag to specify if we need to update ColumnStat of a column * for subsequent conditions - * @return a double value to show the percentage of rows meeting a given condition. + * @return an optional double value to show the percentage of rows meeting a given condition. * It returns None if the condition is not supported. */ def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = { - condition match { case And(cond1, cond2) => -(calculateFilterSelectivity(cond1, update), calculateFilterSelectivity(cond2, update)) -match { +// For ease of debugging, we compute percent1 and percent2 in 2 statements. +val percent1 = calculateFilterSelectivity(cond1, update) +val percent2 = calculateFilterSelectivity(cond2, update) +(percent1, percent2) match { case (Some(p1), Some(p2)) => Some(p1 * p2) case (Some(p1), None) => Some(p1) --- End diff -- This shows that it is difficult to always over-estimate. How about we do not handle the nested NOT. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user ron8hu commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103087345 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -297,6 +278,8 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo Some(DateTimeUtils.toJavaDate(litValue.toString.toInt)) case TimestampType => Some(DateTimeUtils.toJavaTimestamp(litValue.toString.toLong)) + case _: DecimalType => +Some(litValue.asInstanceOf[Decimal].toJavaBigDecimal) --- End diff -- Agreed. Thanks for fixing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103073176 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -361,57 +343,52 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo */ def evaluateInSet( - attrRef: AttributeReference, + attr: Attribute, hSet: Set[Any], - update: Boolean) -: Option[Double] = { -if (!mutableColStats.contains(attrRef.exprId)) { - logDebug("[CBO] No statistics for " + attrRef) + update: Boolean): Option[Double] = { +if (!colStatsMap.contains(attr)) { + logDebug("[CBO] No statistics for " + attr) return None } -val aColStat = mutableColStats(attrRef.exprId) -val ndv = aColStat.distinctCount -val aType = attrRef.dataType -var newNdv: Long = 0 +val colStat = colStatsMap(attr) +val ndv = colStat.distinctCount +val dataType = attr.dataType +var newNdv = ndv // use [min, max] to filter the original hSet -aType match { - case _: NumericType | DateType | TimestampType => -val statsRange = - Range(aColStat.min, aColStat.max, aType).asInstanceOf[NumericRange] - -// To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal. -// Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec. -val hSetBigdec = hSet.map(e => BigDecimal(e.toString)) -val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && e <= statsRange.max) -// We use hSetBigdecToAnyMap to help us find the original hSet value. -val hSetBigdecToAnyMap: Map[BigDecimal, Any] = - hSet.map(e => BigDecimal(e.toString) -> e).toMap +dataType match { + case _: NumericType | BooleanType | DateType | TimestampType => --- End diff -- add boolean type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103073163 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -297,6 +278,8 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo Some(DateTimeUtils.toJavaDate(litValue.toString.toInt)) case TimestampType => Some(DateTimeUtils.toJavaTimestamp(litValue.toString.toLong)) + case _: DecimalType => +Some(litValue.asInstanceOf[Decimal].toJavaBigDecimal) --- End diff -- @ron8hu the external value type of `DecimalType` is java decimal, and the internal value type is `Decimal`, we need to convert it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103073122 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -258,27 +246,20 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo */ def evaluateBinary( op: BinaryComparison, - attrRef: AttributeReference, + attr: Attribute, literal: Literal, - update: Boolean) -: Option[Double] = { -if (!mutableColStats.contains(attrRef.exprId)) { - logDebug("[CBO] No statistics for " + attrRef) - return None -} - -op match { - case EqualTo(l, r) => evaluateEqualTo(attrRef, literal, update) + update: Boolean): Option[Double] = { +attr.dataType match { + case _: NumericType | DateType | TimestampType => +evaluateBinaryForNumeric(op, attr, literal, update) + case StringType | BinaryType => +// TODO: It is difficult to support other binary comparisons for String/Binary +// type without min/max and advanced statistics like histogram. +logDebug("[CBO] No range comparison statistics for String/Binary type " + attr) +None case _ => -attrRef.dataType match { - case _: NumericType | DateType | TimestampType => -evaluateBinaryForNumeric(op, attrRef, literal, update) - case StringType | BinaryType => --- End diff -- previously we totally missed `BooleanType` and will throw `MatchError` if the attribute is bool. But the logic in `evaluateBinaryForNumeric` doesn't work for boolean, so I treat it as supported for now. @wzhfy do you have time to work on it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103073081 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -140,56 +129,56 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo * @param condition a single logical expression * @param update a boolean flag to specify if we need to update ColumnStat of a column * for subsequent conditions - * @return Option[Double] value to show the percentage of rows meeting a given condition. + * @return an optional double value to show the percentage of rows meeting a given condition. * It returns None if the condition is not supported. */ def calculateSingleCondition(condition: Expression, update: Boolean): Option[Double] = { condition match { // For evaluateBinary method, we assume the literal on the right side of an operator. // So we will change the order if not. - // EqualTo does not care about the order - case op @ EqualTo(ar: AttributeReference, l: Literal) => -evaluateBinary(op, ar, l, update) - case op @ EqualTo(l: Literal, ar: AttributeReference) => -evaluateBinary(op, ar, l, update) + // EqualTo/EqualNullSafe does not care about the order --- End diff -- also support `EqualNullSafe` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17065#discussion_r103073076 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala --- @@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: CatalystConf) extends Lo * @param condition the compound logical expression * @param update a boolean flag to specify if we need to update ColumnStat of a column * for subsequent conditions - * @return a double value to show the percentage of rows meeting a given condition. + * @return an optional double value to show the percentage of rows meeting a given condition. * It returns None if the condition is not supported. */ def calculateFilterSelectivity(condition: Expression, update: Boolean = true): Option[Double] = { - condition match { case And(cond1, cond2) => -(calculateFilterSelectivity(cond1, update), calculateFilterSelectivity(cond2, update)) -match { +// For ease of debugging, we compute percent1 and percent2 in 2 statements. +val percent1 = calculateFilterSelectivity(cond1, update) +val percent2 = calculateFilterSelectivity(cond2, update) +(percent1, percent2) match { case (Some(p1), Some(p2)) => Some(p1 * p2) case (Some(p1), None) => Some(p1) --- End diff -- here we are actually over-estimating: if a condition is unsupported in `And`, we assume it's 100% selectivity, which may leads to under-estimation if this `And` is wrapped by `Not`. We should 1. if one condition is unsupported, this `And` is unsupported 2. do not handle nested `Not` cc @wzhfy @ron8hu --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/17065 [SPARK-17075][SQL][followup] fix some minor issues and clean up the code ## What changes were proposed in this pull request? This fixes some code style issues, naming issues, some missing cases in pattern match, etc. ## How was this patch tested? existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark follow-up Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17065.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17065 commit 04cc6811445790a636a534c42e2caf053a238eb8 Author: Wenchen FanDate: 2017-02-25T01:06:37Z fix some minor issues and clean up the code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org