[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-03-02 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r104098256
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
* @param condition the compound logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return a double value to show the percentage of rows meeting a given 
condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateFilterSelectivity(condition: Expression, update: Boolean = 
true): Option[Double] = {
-
 condition match {
   case And(cond1, cond2) =>
-(calculateFilterSelectivity(cond1, update), 
calculateFilterSelectivity(cond2, update))
-match {
+// For ease of debugging, we compute percent1 and percent2 in 2 
statements.
+val percent1 = calculateFilterSelectivity(cond1, update)
+val percent2 = calculateFilterSelectivity(cond2, update)
+(percent1, percent2) match {
   case (Some(p1), Some(p2)) => Some(p1 * p2)
   case (Some(p1), None) => Some(p1)
--- End diff --

@cloud-fan @ron8hu I'm a little confused about this, for Not expression, it 
always becomes under-estimation if we do over-estimation, no matter it's nested 
or not. So should we remove support for `nested Not` or `Not`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17065


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-25 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103097107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
* @param condition the compound logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return a double value to show the percentage of rows meeting a given 
condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateFilterSelectivity(condition: Expression, update: Boolean = 
true): Option[Double] = {
-
 condition match {
   case And(cond1, cond2) =>
-(calculateFilterSelectivity(cond1, update), 
calculateFilterSelectivity(cond2, update))
-match {
+// For ease of debugging, we compute percent1 and percent2 in 2 
statements.
+val percent1 = calculateFilterSelectivity(cond1, update)
+val percent2 = calculateFilterSelectivity(cond2, update)
+(percent1, percent2) match {
   case (Some(p1), Some(p2)) => Some(p1 * p2)
   case (Some(p1), None) => Some(p1)
--- End diff --

SGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-25 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103090517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -361,57 +343,52 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
*/
 
   def evaluateInSet(
-  attrRef: AttributeReference,
+  attr: Attribute,
   hSet: Set[Any],
-  update: Boolean)
-: Option[Double] = {
-if (!mutableColStats.contains(attrRef.exprId)) {
-  logDebug("[CBO] No statistics for " + attrRef)
+  update: Boolean): Option[Double] = {
+if (!colStatsMap.contains(attr)) {
+  logDebug("[CBO] No statistics for " + attr)
   return None
 }
 
-val aColStat = mutableColStats(attrRef.exprId)
-val ndv = aColStat.distinctCount
-val aType = attrRef.dataType
-var newNdv: Long = 0
+val colStat = colStatsMap(attr)
+val ndv = colStat.distinctCount
+val dataType = attr.dataType
+var newNdv = ndv
 
 // use [min, max] to filter the original hSet
-aType match {
-  case _: NumericType | DateType | TimestampType =>
-val statsRange =
-  Range(aColStat.min, aColStat.max, 
aType).asInstanceOf[NumericRange]
-
-// To facilitate finding the min and max values in hSet, we map 
hSet values to BigDecimal.
-// Using hSetBigdec, we can find the min and max values quickly in 
the ordered hSetBigdec.
-val hSetBigdec = hSet.map(e => BigDecimal(e.toString))
-val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && 
e <= statsRange.max)
-// We use hSetBigdecToAnyMap to help us find the original hSet 
value.
-val hSetBigdecToAnyMap: Map[BigDecimal, Any] =
-  hSet.map(e => BigDecimal(e.toString) -> e).toMap
+dataType match {
+  case _: NumericType | BooleanType | DateType | TimestampType =>
+val statsRange = Range(colStat.min, colStat.max, 
dataType).asInstanceOf[NumericRange]
+val validQuerySet = hSet.filter { v =>
+  v != null && statsRange.contains(Literal(v, dataType))
--- End diff --

nit: for better readability:  (v != null) && statsRange.contains(Literal(v, 
dataType))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-25 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103087483
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
* @param condition the compound logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return a double value to show the percentage of rows meeting a given 
condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateFilterSelectivity(condition: Expression, update: Boolean = 
true): Option[Double] = {
-
 condition match {
   case And(cond1, cond2) =>
-(calculateFilterSelectivity(cond1, update), 
calculateFilterSelectivity(cond2, update))
-match {
+// For ease of debugging, we compute percent1 and percent2 in 2 
statements.
+val percent1 = calculateFilterSelectivity(cond1, update)
+val percent2 = calculateFilterSelectivity(cond2, update)
+(percent1, percent2) match {
   case (Some(p1), Some(p2)) => Some(p1 * p2)
   case (Some(p1), None) => Some(p1)
--- End diff --

This shows that it is difficult to always over-estimate.  How about we do 
not handle the nested NOT.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-25 Thread ron8hu
Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103087345
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -297,6 +278,8 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
 Some(DateTimeUtils.toJavaDate(litValue.toString.toInt))
   case TimestampType =>
 Some(DateTimeUtils.toJavaTimestamp(litValue.toString.toLong))
+  case _: DecimalType =>
+Some(litValue.asInstanceOf[Decimal].toJavaBigDecimal)
--- End diff --

Agreed.  Thanks for fixing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073176
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -361,57 +343,52 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
*/
 
   def evaluateInSet(
-  attrRef: AttributeReference,
+  attr: Attribute,
   hSet: Set[Any],
-  update: Boolean)
-: Option[Double] = {
-if (!mutableColStats.contains(attrRef.exprId)) {
-  logDebug("[CBO] No statistics for " + attrRef)
+  update: Boolean): Option[Double] = {
+if (!colStatsMap.contains(attr)) {
+  logDebug("[CBO] No statistics for " + attr)
   return None
 }
 
-val aColStat = mutableColStats(attrRef.exprId)
-val ndv = aColStat.distinctCount
-val aType = attrRef.dataType
-var newNdv: Long = 0
+val colStat = colStatsMap(attr)
+val ndv = colStat.distinctCount
+val dataType = attr.dataType
+var newNdv = ndv
 
 // use [min, max] to filter the original hSet
-aType match {
-  case _: NumericType | DateType | TimestampType =>
-val statsRange =
-  Range(aColStat.min, aColStat.max, 
aType).asInstanceOf[NumericRange]
-
-// To facilitate finding the min and max values in hSet, we map 
hSet values to BigDecimal.
-// Using hSetBigdec, we can find the min and max values quickly in 
the ordered hSetBigdec.
-val hSetBigdec = hSet.map(e => BigDecimal(e.toString))
-val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && 
e <= statsRange.max)
-// We use hSetBigdecToAnyMap to help us find the original hSet 
value.
-val hSetBigdecToAnyMap: Map[BigDecimal, Any] =
-  hSet.map(e => BigDecimal(e.toString) -> e).toMap
+dataType match {
+  case _: NumericType | BooleanType | DateType | TimestampType =>
--- End diff --

add boolean type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073163
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -297,6 +278,8 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
 Some(DateTimeUtils.toJavaDate(litValue.toString.toInt))
   case TimestampType =>
 Some(DateTimeUtils.toJavaTimestamp(litValue.toString.toLong))
+  case _: DecimalType =>
+Some(litValue.asInstanceOf[Decimal].toJavaBigDecimal)
--- End diff --

@ron8hu the external value type of `DecimalType` is java decimal, and the 
internal value type is `Decimal`, we need to convert it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073122
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -258,27 +246,20 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
*/
   def evaluateBinary(
   op: BinaryComparison,
-  attrRef: AttributeReference,
+  attr: Attribute,
   literal: Literal,
-  update: Boolean)
-: Option[Double] = {
-if (!mutableColStats.contains(attrRef.exprId)) {
-  logDebug("[CBO] No statistics for " + attrRef)
-  return None
-}
-
-op match {
-  case EqualTo(l, r) => evaluateEqualTo(attrRef, literal, update)
+  update: Boolean): Option[Double] = {
+attr.dataType match {
+  case _: NumericType | DateType | TimestampType =>
+evaluateBinaryForNumeric(op, attr, literal, update)
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attr)
+None
   case _ =>
-attrRef.dataType match {
-  case _: NumericType | DateType | TimestampType =>
-evaluateBinaryForNumeric(op, attrRef, literal, update)
-  case StringType | BinaryType =>
--- End diff --

previously we totally missed `BooleanType` and will throw `MatchError` if 
the attribute is bool. But the logic in `evaluateBinaryForNumeric` doesn't work 
for boolean, so I treat it as supported for now. @wzhfy do you have time to 
work on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073081
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -140,56 +129,56 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
* @param condition a single logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return Option[Double] value to show the percentage of rows meeting a 
given condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateSingleCondition(condition: Expression, update: Boolean): 
Option[Double] = {
 condition match {
   // For evaluateBinary method, we assume the literal on the right 
side of an operator.
   // So we will change the order if not.
 
-  // EqualTo does not care about the order
-  case op @ EqualTo(ar: AttributeReference, l: Literal) =>
-evaluateBinary(op, ar, l, update)
-  case op @ EqualTo(l: Literal, ar: AttributeReference) =>
-evaluateBinary(op, ar, l, update)
+  // EqualTo/EqualNullSafe does not care about the order
--- End diff --

also support `EqualNullSafe`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17065#discussion_r103073076
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -95,15 +84,16 @@ case class FilterEstimation(plan: Filter, catalystConf: 
CatalystConf) extends Lo
* @param condition the compound logical expression
* @param update a boolean flag to specify if we need to update 
ColumnStat of a column
*   for subsequent conditions
-   * @return a double value to show the percentage of rows meeting a given 
condition.
+   * @return an optional double value to show the percentage of rows 
meeting a given condition.
* It returns None if the condition is not supported.
*/
   def calculateFilterSelectivity(condition: Expression, update: Boolean = 
true): Option[Double] = {
-
 condition match {
   case And(cond1, cond2) =>
-(calculateFilterSelectivity(cond1, update), 
calculateFilterSelectivity(cond2, update))
-match {
+// For ease of debugging, we compute percent1 and percent2 in 2 
statements.
+val percent1 = calculateFilterSelectivity(cond1, update)
+val percent2 = calculateFilterSelectivity(cond2, update)
+(percent1, percent2) match {
   case (Some(p1), Some(p2)) => Some(p1 * p2)
   case (Some(p1), None) => Some(p1)
--- End diff --

here we are actually over-estimating: if a condition is unsupported in 
`And`, we assume it's 100% selectivity, which may leads to under-estimation if 
this `And` is wrapped by `Not`.

We should

1. if one condition is unsupported, this `And` is unsupported
2. do not handle nested `Not`

cc @wzhfy @ron8hu 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17065: [SPARK-17075][SQL][followup] fix some minor issue...

2017-02-24 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/17065

[SPARK-17075][SQL][followup] fix some minor issues and clean up the code

## What changes were proposed in this pull request?

This fixes some code style issues, naming issues, some missing cases in 
pattern match, etc.

## How was this patch tested?

existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark follow-up

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17065.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17065


commit 04cc6811445790a636a534c42e2caf053a238eb8
Author: Wenchen Fan 
Date:   2017-02-25T01:06:37Z

fix some minor issues and clean up the code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org