[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-12-03 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23171
  
@rxin `switch` in Java is still significantly faster than hash set even 
without boxing / unboxing problems when the number of elements are small. We 
were thinking about to have two implementations in `InSet`, and pick up 
`switch` if the number of elements are small, or otherwise pick up hash set 
one. But this is the same complexity as having two implements in `In` as this 
PR. 

@cloud-fan do you suggest to create an `OptimizeIn` which has `switch` and 
hash set implementations based on the length of the elements and remove 
`InSet`? Basically, what we were thinking above.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-29 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23171
  
@cloud-fan as @aokolnychyi said, `switch` will still be faster than 
optimized `Set` without autoboxing when the number of elements are small. As a 
result, this PR is still very useful.  

@mgaido91 `InSet` can be better when we implement properly without 
autoboxing for large numbers of elements controlled by 
`spark.sql.optimizer.inSetConversionThreshold`. Also, generating `In` with huge 
lists can cause a compile exception due to the method size limit as you pointed 
out. As a result, we should convert it into `InSet` for large set.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23100
  
I went through the PR again, and it looks right to me. Merged into master. 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23100
  
It's hard to track the huge diffs on renaming. I don't go though it 
line-by-line. But if they're just renaming, the rest LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints

2018-11-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23171
  
The approach looks great, and can significantly improve the performance. 
For Long, I agree that we should also implement binary search approach for 
`O(logn)` look up. 

Wondering which one will be faster, binary search using arrays or rewrite 
the `if-else` in binary search form.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts,...

2018-11-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23171#discussion_r237227892
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -335,6 +343,41 @@ case class In(value: Expression, list: 
Seq[Expression]) extends Predicate {
""".stripMargin)
   }
 
+  private def genCodeWithSwitch(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+val (nullLiterals, nonNullLiterals) = list.partition {
+  case Literal(null, _) => true
+  case _ => false
+}
--- End diff --

If there is null in the list, it will be only one. As a result, we may not 
need to use `nullLiterals`.

```scala
val containNullInList = ...
val nonNullLiterals = ... 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts,...

2018-11-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23171#discussion_r237226275
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -335,6 +343,41 @@ case class In(value: Expression, list: 
Seq[Expression]) extends Predicate {
""".stripMargin)
   }
 
+  private def genCodeWithSwitch(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+val (nullLiterals, nonNullLiterals) = list.partition {
+  case Literal(null, _) => true
+  case _ => false
+}
+val listGen = nonNullLiterals.map(_.genCode(ctx))
+val valueGen = value.genCode(ctx)
+
+val caseBranches = listGen.map(literal =>
+  s"""
+ |case ${literal.value}:
+ |  ${ev.value} = true;
+ |  break;
+   """.stripMargin)
+
+ev.copy(code =
+  code"""
+ |${valueGen.code}
+ |${CodeGenerator.JAVA_BOOLEAN} ${ev.isNull} = ${valueGen.isNull};
+ |${CodeGenerator.JAVA_BOOLEAN} ${ev.value} = false;
+ |if (!${valueGen.isNull}) {
+ |  switch (${valueGen.value}) {
+ |${caseBranches.mkString("")}
+ |default:
+ |  ${ev.isNull} = ${nullLiterals.nonEmpty};
+ |  }
+ |}
+   """.stripMargin)
+  }
+
+  private def isSwitchCompatible: Boolean = list.forall {
+case Literal(_, dt) => dt == ByteType || dt == ShortType || dt == 
IntegerType
--- End diff --

```scala
case Literal(_, dt) if dt == ByteType || dt == ShortType || dt == 
IntegerType => true
```
is easier to read?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-27 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23148
  
Thanks for testing it out. I personally like auto-formatting as my company 
projects are using scalafmt and we find it's very useful to keep consistent 
coding style. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-27 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23148
  
My concern is that let's say we have a code like the following which I 
copied from `ParquetSchemaPruningSuite.scala`; the scalafmt will complaint the 
second line is longer than 98 and reformat it. But this should be a legit 
coding style as many times, we are trying to put the code in one line for 
better readability. 

```scala
checkScan(query,
  
"struct,address:string,pets:int,"
 +
  "friends:array>," +
  
"relatives:map>>")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23148#discussion_r236830497
  
--- Diff: dev/.scalafmt.conf ---
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+align = none
+align.openParenDefnSite = false
+align.openParenCallSite = false
+align.tokens = []
+docstrings = JavaDoc
+maxColumn = 98
--- End diff --

If we set it as `98`, will this complain for legit code with 100 chars?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23148: [SPARK-26177] Automated formatting for Scala code

2018-11-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23148#discussion_r236813956
  
--- Diff: dev/.scalafmt.conf ---
@@ -0,0 +1,24 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+align = none
+align.openParenDefnSite = false
+align.openParenCallSite = false
+align.tokens = []
+docstrings = JavaDoc
+maxColumn = 98
--- End diff --

Are we using 100 for maxColumn?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...

2018-11-26 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23139
  
Thanks. Merged into master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...

2018-11-26 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23139
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule Repla...

2018-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23139#discussion_r236463594
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicate.scala
 ---
@@ -79,29 +80,31 @@ object ReplaceNullWithFalseInPredicate extends 
Rule[LogicalPlan] {
* an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or
* `Literal(null, BooleanType)`.
*/
-  private def replaceNullWithFalse(e: Expression): Expression = {
-if (e.dataType != BooleanType) {
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case Literal(null, BooleanType) =>
+  FalseLiteral
+case And(left, right) =>
+  And(replaceNullWithFalse(left), replaceNullWithFalse(right))
+case Or(left, right) =>
+  Or(replaceNullWithFalse(left), replaceNullWithFalse(right))
+case cw: CaseWhen if cw.dataType == BooleanType =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case i @ If(pred, trueVal, falseVal) if i.dataType == BooleanType =>
+  If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), 
replaceNullWithFalse(falseVal))
+case e if e.dataType == BooleanType =>
   e
-} else {
-  e match {
-case Literal(null, BooleanType) =>
-  FalseLiteral
-case And(left, right) =>
-  And(replaceNullWithFalse(left), replaceNullWithFalse(right))
-case Or(left, right) =>
-  Or(replaceNullWithFalse(left), replaceNullWithFalse(right))
-case cw: CaseWhen =>
-  val newBranches = cw.branches.map { case (cond, value) =>
-replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
-  }
-  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
-  CaseWhen(newBranches, newElseValue)
-case If(pred, trueVal, falseVal) =>
-  If(replaceNullWithFalse(pred),
-replaceNullWithFalse(trueVal),
-replaceNullWithFalse(falseVal))
-case _ => e
+case e =>
+  val message = "Expected a Boolean type expression in 
replaceNullWithFalse, " +
+s"but got the type `${e.dataType.catalogString}` in `${e.sql}`."
+  if (Utils.isTesting) {
+throw new IllegalArgumentException(message)
--- End diff --

Sounds fair.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

2018-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23100#discussion_r236411677
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
 
 package org.apache.spark.ml.feature
 
+import org.apache.hadoop.fs.Path
--- End diff --

Or we can file two PRs. One for removing old `OneHotEncoder`, and the other 
one for renaming.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

2018-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23100#discussion_r236410750
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
 
 package org.apache.spark.ml.feature
 
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
 import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vectors
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data during transform().
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Note that this Param is only used during transform; during fitting, 
invalid data
+   * will result in an error.
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
--- End diff --

As we discussed previously, it's a new class. Should we make it as `3.0`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

2018-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23100#discussion_r236410306
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
 
 package org.apache.spark.ml.feature
 
+import org.apache.hadoop.fs.Path
--- End diff --

I guess once the commits of the history are squashed into one, it will 
still like this without better history.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule Repla...

2018-11-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/23139#discussion_r236394865
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicate.scala
 ---
@@ -79,29 +80,31 @@ object ReplaceNullWithFalseInPredicate extends 
Rule[LogicalPlan] {
* an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or
* `Literal(null, BooleanType)`.
*/
-  private def replaceNullWithFalse(e: Expression): Expression = {
-if (e.dataType != BooleanType) {
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case Literal(null, BooleanType) =>
+  FalseLiteral
+case And(left, right) =>
+  And(replaceNullWithFalse(left), replaceNullWithFalse(right))
+case Or(left, right) =>
+  Or(replaceNullWithFalse(left), replaceNullWithFalse(right))
+case cw: CaseWhen if cw.dataType == BooleanType =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case i @ If(pred, trueVal, falseVal) if i.dataType == BooleanType =>
+  If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), 
replaceNullWithFalse(falseVal))
+case e if e.dataType == BooleanType =>
   e
-} else {
-  e match {
-case Literal(null, BooleanType) =>
-  FalseLiteral
-case And(left, right) =>
-  And(replaceNullWithFalse(left), replaceNullWithFalse(right))
-case Or(left, right) =>
-  Or(replaceNullWithFalse(left), replaceNullWithFalse(right))
-case cw: CaseWhen =>
-  val newBranches = cw.branches.map { case (cond, value) =>
-replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
-  }
-  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
-  CaseWhen(newBranches, newElseValue)
-case If(pred, trueVal, falseVal) =>
-  If(replaceNullWithFalse(pred),
-replaceNullWithFalse(trueVal),
-replaceNullWithFalse(falseVal))
-case _ => e
+case e =>
+  val message = "Expected a Boolean type expression in 
replaceNullWithFalse, " +
+s"but got the type `${e.dataType.catalogString}` in `${e.sql}`."
+  if (Utils.isTesting) {
+throw new IllegalArgumentException(message)
--- End diff --

Test for this? Why not also throw exception in runtime since this should 
never be hit? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...

2018-11-26 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23139
  
Although we are trying to make sure in the caller side to only call 
`replaceNullWithFalse` when the expression is boolean type, I agree that for 
safety, we should check it and throw exception for future development. 

LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23118: [SPARK-26144][BUILD] `build/mvn` should detect `scala.ve...

2018-11-26 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/23118
  
Late to the party! Thanks @dongjoon-hyun for taking care of this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...

2018-11-14 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22967
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...

2018-11-13 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22967
  
@dongjoon-hyun thanks for trigging the build. The python test script was 
only looking for scala 2.11 jars resulting python test failures. I just fixed 
it in the latest push. Let's see how it goes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r22593
  
--- Diff: pom.xml ---
@@ -2718,7 +2710,6 @@
 
   
 *:*_2.11
-*:*_2.10
--- End diff --

Thanks for the suggestion, and I agree this will make the default scala 
2.12 profile cleaner.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22977: [SPARK-26030][BUILD] Bump previousSparkVersion in MimaBu...

2018-11-12 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22977
  
LGTM. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22764: [SPARK-25765][ML] Add training cost to BisectingKMeans s...

2018-11-11 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22764
  
@mgaido91 I'm on thanksgiving vacation, will be back to community to help 
code review on Nov 21st. Sorry for the delay.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...

2018-11-11 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22967
  
Waiting https://github.com/apache/spark/pull/22977 to be merged, and I'll 
rebase from it and fix the remaining binary incompatibilities. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r232510439
  
--- Diff: pom.xml ---
@@ -2717,7 +2717,6 @@
   
 
   
-*:*_2.11
 *:*_2.10
--- End diff --

Make sense. I made the parent rule to exclude 2.10, and moved the exclusion 
of 2.11 to 2.12 profile. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r232505402
  
--- Diff: docs/sparkr.md ---
@@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` 
commands, or if initiali
 
 
 {% highlight r %}
-sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+sparkR.session()
--- End diff --

I just did a text search and replacement, and I didn't read the context of 
having `sparkPackages = "com.databricks:spark-avro_2.11:3.0.0"` here. My bad.

Although avro is now part of spark codebase, but it's in external package 
which is not in the classpath by default. How about I change it to 
`sparkPackages = "org.apache.spark:spark-avro_2.12:3.0.0"` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r232024914
  
--- Diff: docs/sparkr.md ---
@@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` 
commands, or if initiali
 
 
 {% highlight r %}
-sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+sparkR.session()
--- End diff --

I thought `com.databricks:spark-avro_2.12` is deprecated and no longer 
exist. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r232024557
  
--- Diff: pom.xml ---
@@ -1998,7 +1998,7 @@
   -->
   org.jboss.netty
   org.codehaus.groovy
-  *:*_2.10
+  *:*_2.11
 
--- End diff --

@srowen Can you take a look if this looks right now? Thanks! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r231781938
  
--- Diff: docs/sparkr.md ---
@@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` 
commands, or if initiali
 
 
 {% highlight r %}
-sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+sparkR.session()
--- End diff --

Get you. 

BTW, are you familiar with Mima? I still can not figure out why it's still 
failing. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22967#discussion_r231781635
  
--- Diff: docs/sparkr.md ---
@@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` 
commands, or if initiali
 
 
 {% highlight r %}
-sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
+sparkR.session()
--- End diff --

I am not familiar with R. Can you elaborate? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark

2018-11-07 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22966
  
jmh is a framework to write benchmark that can generate standardized 
reports to be consumed by Jenkins. 

Here is an example, 
https://github.com/pvillega/jmh-scala-test/blob/master/src/main/scala/com/perevillega/JMHTest.scala


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...

2018-11-07 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22967
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22970: [SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 build erro...

2018-11-07 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22970
  
Merged into master as the compilation finished. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22970: [SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 build erro...

2018-11-07 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22970
  
LGTM. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark

2018-11-07 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22966
  
cc @jleach4 and @aokolnychyi 

We have a great success using 
[jmh](http://openjdk.java.net/projects/code-tools/jmh/) for this type of 
benchmarking; the benchmarks can be written in the unit test. This framework 
handles JVM warn-up, computes the latency, and throughput, etc, and then 
generates reports that can be consumed in Jenkins. We also use Jenkins to 
visualize the trend of performance changes which is very useful to find 
regressions. 





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...

2018-11-07 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22967
  
@dongjoon-hyun Yeah, seems 
https://github.com/apache/spark/commit/63ca4bbe792718029f6d6196e8a6bb11d1f20fca 
breaks the Scala 2.12 build.

I'll re-trigger the build once Scala 2.12 build is fixed. 

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...

2018-11-07 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/22967

[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 

## What changes were proposed in this pull request?

This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be 
the alternative version. This implies that Scala 2.12 will be used by our CI 
builds including pull request builds.

We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 
to ensure the code can be still compiled with Scala 2.11.

## How was this patch tested?

existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark scala2.12

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22967


commit 635e6e23c5066018fd656738c51d02df8130585e
Author: DB Tsai 
Date:   2018-11-06T22:13:11Z

make scala 2.12 as default

commit 5011dc07c6462e7f5a9974a0b9b28f937d678297
Author: DB Tsai 
Date:   2018-11-06T23:11:34Z

sbt change

commit b4b9cb95df35b754432fb74361c32f563d1661b0
Author: DB Tsai 
Date:   2018-11-07T00:02:22Z

address feedback

commit 292adb111750cfe98593f12f64ebe11067482b44
Author: DB Tsai 
Date:   2018-11-07T00:35:58Z

address feedback




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to suppo...

2018-11-05 Thread dbtsai
Github user dbtsai closed the pull request at:

https://github.com/apache/spark/pull/22953


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...

2018-11-05 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22953
  
Thanks. Merged into master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22947: [SPARK-24913][SQL] Make AssertNotNull and AssertT...

2018-11-05 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22947#discussion_r230975661
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala
 ---
@@ -66,6 +66,8 @@ case class AssertTrue(child: Expression) extends 
UnaryExpression with ImplicitCa
 
   override def nullable: Boolean = true
 
+  override lazy val deterministic: Boolean = false
--- End diff --

Because of this, I'm leaning towards creating a new flag instead of making 
them non-deterministic.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...

2018-11-05 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22953
  
ASM6 supports Java 9 while ASM7 supports Java 9, Java 10, and Java 11.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...

2018-11-05 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22953
  
cc @gatorsmile @srowen @HyukjinKwon  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to suppo...

2018-11-05 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/22953

[SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK11

## What changes were proposed in this pull request?

Upgrade ASM to 7.x to support JDK11

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark asm7

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22953


commit 7b19dc8616670c8db4b853e7fcfd192f8f55e09a
Author: DB Tsai 
Date:   2018-11-05T23:03:10Z

upgrade asm to 7.0




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22786: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans examp...

2018-11-05 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22786
  
LGTM. Merged into master. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22869: [SPARK-25758][ML] Deprecate computeCost in BisectingKMea...

2018-11-05 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22869
  
LGTM too. Merged into master. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...

2018-11-01 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22919
  
I'm also on @cloud-fan's side---we should keep it consistent with the 
upstream Scala Shell. However, we should document it on `./bin/spark-shell 
--help`, so when a user complains or files a ticket, we can refer them to the 
doc. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-31 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22857
  
Thanks all for reviewing! The latest change looks good to me too. Merged 
into master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22880: [SPARK-25407][SQL] Ensure we pass a compatible pruned sc...

2018-10-30 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22880
  
I can confirm that this fixes 
https://issues.apache.org/jira/browse/SPARK-25879

cc @cloud-fan @gatorsmile 

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-10-30 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/21320
  
cc @viirya 

If we select a nested field and a top level field, the schema pruning will 
fail. Here is the reproducible test,
```scala
  testSchemaPruning("select a single complex field and a top level field") {
val query = sql("select * from contacts")
  .select("name.middle", "address")
query.explain(true)
query.printSchema()
query.show()
checkScan(query, "struct,address:string>")
  }
```

and the exception is

```
23:16:05.864 ERROR org.apache.spark.executor.Executor: Exception in task 
1.0 in stage 3.0 (TID 6)
org.apache.spark.sql.execution.QueryExecutionException: Encounter error 
while reading parquet files. One possible cause: Parquet column cannot be 
converted in the corresponding files. Details: 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:674)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:850)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:850)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:325)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:289)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:419)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:425)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file 
file:/private/var/folders/pr/4q3b9vkx36lbygjr5jhfmjcwgn/T/spark-a4fff68d-d51a-4c79-aa18-54cfd7f81a75/contacts/p=2/part-0-8a4d9396-7be3-4fed-a55a-5580684ebda6-c000.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
... 19 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:97)
at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:92)
at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at 
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 24 more
23:16:05.896 WARN org.apache.spark.

[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228741341
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
--- End diff --

We only do the replacements when 1) within `Join` or `Filter` such as 
`Filter(If(cond, FalseLiteral, Literal(null, _)))`, or 2) `If(Literal(null, _), 
trueValue, falseValue)`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228739082
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case If(pred, trueVal, falseVal) if Seq(trueVal, 
falseVal).forall(isNullOrBoolean) =>
--- End diff --

Nit, in other place, we use `trueValue` and `falseValue`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228739018
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseSuite.scala
 ---
@@ -0,0 +1,324 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.{And, CaseWhen, 
Expression, GreaterThan, If, Literal, Or}
+import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, 
TrueLiteral}
+import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest}
+import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, 
LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+import org.apache.spark.sql.types.{BooleanType, IntegerType}
+
+class ReplaceNullWithFalseSuite extends PlanTest {
+
+  object Optimize extends RuleExecutor[LogicalPlan] {
+val batches =
+  Batch("Replace null literals", FixedPoint(10),
+NullPropagation,
+ConstantFolding,
+BooleanSimplification,
+SimplifyConditionals,
+ReplaceNullWithFalse) :: Nil
+  }
+
+  private val testRelation = LocalRelation('i.int, 'b.boolean)
+  private val anotherTestRelation = LocalRelation('d.int)
+
+  test("successful replacement of null literals in filter and join 
conditions (1)") {
+testFilter(originalCond = Literal(null), expectedCond = FalseLiteral)
+testJoin(originalCond = Literal(null), expectedCond = FalseLiteral)
+  }
+
+  test("successful replacement of null literals in filter and join 
conditions (2)") {
+val originalCond = If(
+  UnresolvedAttribute("i") > Literal(10),
+  FalseLiteral,
+  Literal(null, BooleanType))
+testFilter(originalCond, expectedCond = FalseLiteral)
+testJoin(originalCond, expectedCond = FalseLiteral)
+  }
+
+  test("successful replacement of null literals in filter and join 
conditions (3)") {
+val originalCond = If(
+  UnresolvedAttribute("i") > Literal(10),
+  TrueLiteral && Literal(null, BooleanType),
+  UnresolvedAttribute("b") && Literal(null, BooleanType))
+testFilter(originalCond, expectedCond = FalseLiteral)
+testJoin(originalCond, expectedCond = FalseLiteral)
+  }
+
+  test("successful replacement of null literals in filter and join 
conditions (4)") {
+val branches = Seq(
+  (UnresolvedAttribute("i") < Literal(10)) -> TrueLiteral,
+  (UnresolvedAttribute("i") > Literal(40)) -> FalseLiteral)
+val originalCond = CaseWhen(branches, Literal(null, BooleanType))
+val expectedCond = CaseWhen(branches, FalseLiteral)
+testFilter(originalCond, expectedCond)
+testJoin(originalCond, expectedCond)
+  }
+
+  test("successful replacement of null literals in filter and join 
conditions (5)") {
+val branches = Seq(
+  (UnresolvedAttribute("i") < Literal(10)) -> Literal(null, 
BooleanType),
+  (UnresolvedAttribute("i") > Literal(40)) -> FalseLiteral)
+val originalCond = CaseWhen(branches, Literal(null))
+testFilter(originalCond, expectedCond = FalseLiteral)
+testJoin(originalCond, expectedCond = FalseLiteral)
+  }
+
+  test("successful replacement of null literals in filter and join 
conditions (6)") {
+val originalBranches = Seq(
+  (UnresolvedAttribute("i") < Literal(10)) ->
+If(UnresolvedAttribute("i") < Literal(20), Literal(null, 
BooleanType), FalseLiteral)

[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...

2018-10-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22857
  
LGTM. 

@cloud-fan and @gatorsmile, this is the PR I mentioned to you earlier this 
year in the SF Spark summit which can simplify some of our queries. 

Also add @dongjoon-hyun and @viirya 

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228738623
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
--- End diff --

Nit,

```scala
case cw @ CaseWhen(branches, _) =>
  ..
  ..
  cw.copy(branches = newBranches)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22839: [SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc a...

2018-10-25 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22839
  
Thanks @dongjoon-hyun This LGTM!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22801: [SPARK-25656][SQL][DOC][EXAMPLE] Add a doc and examples ...

2018-10-23 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22801
  
This LGTM. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22788: [SPARK-25769][SQL]escape nested columns by backtick each...

2018-10-23 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22788
  
@cloud-fan I like the idea of using JSON, but that will also change the 
definition of string format. Do we just use JSON for nested case so the 
existing data source doesn't have to be changed? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22788: [SPARK-25769][SQL]escape nested columns by backtick each...

2018-10-23 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22788
  
@cloud-fan @dongjoon-hyun instead of changing `Filter` API, do you think 
using proper escaped char like this PR in 
https://github.com/apache/spark/pull/22573 is a good approach?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...

2018-10-15 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22597#discussion_r225309479
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25579 ORC PPD should support column names with dot") {
+import testImplicits._
+
+withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+  withTempDir { dir =>
+val path = new File(dir, "orc").getCanonicalPath
+Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
+val df = spark.read.orc(path).where("`col.dot.1` = 1 and 
`col.dot.2` = 2")
+checkAnswer(stripSparkFilter(df), Row(1, 2))
--- End diff --

How do we generalize this into nested cases? The parent struct can contain 
dot as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

2018-10-12 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22597
  
In `ParquetFilter`, the way we test if a predicate pushdown works is by 
removing that predicate from Spark SQL physical plan, and only relying on the 
reader to do the filter. Thus, if there is a bug in pushdown filter in reader, 
Spark will get the incorrect result. This can use in test to ensure no 
regression later.   


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...

2018-10-12 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22597
  
Is it possible to add tests like parquet to remove the filter in Spark SQL 
to ensure that the predicate is pushed down to the reader? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22664: [SPARK-25662][SQL][TEST] Refactor DataSourceReadBenchmar...

2018-10-12 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22664
  
@peter-toth I assigned to you. Thanks for contribution. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22664: [SPARK-25662][SQL][TEST] Refactor DataSourceReadBenchmar...

2018-10-11 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22664
  
Thanks @dongjoon-hyun for ping me. LGTM too. We're working on some parquet 
reader improvement, and this will be useful.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22684: [SPARK-25699][SQL] Partially push down conjunctive predi...

2018-10-10 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22684
  
Merged into master. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22684: [SPARK-25699][SQL] Partially push down conjunctive predi...

2018-10-10 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22684
  
LGTM. Just some styling feedback. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...

2018-10-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22684#discussion_r224179579
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
@@ -90,32 +107,51 @@ private[orc] object OrcFilters extends Logging {
 
 expression match {
   case And(left, right) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
+// At here, it is not safe to just convert one side and remove the 
other side
+// if we do not understand what the parent filters are.
+//
+// Here is an example used to explain the reason.
 // Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
 // convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
 // NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
-for {
-  _ <- buildSearchArgument(dataTypeMap, left, newBuilder)
-  _ <- buildSearchArgument(dataTypeMap, right, newBuilder)
-  lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd())
-  rhs <- buildSearchArgument(dataTypeMap, right, lhs)
-} yield rhs.end()
+//
+// Pushing one side of AND down is only safe to do at the top 
level or in the child
+// AND before hitting NOT or OR conditions, and in this case, the 
unsupported predicate
+// can be safely removed.
+val leftBuilderOption = createBuilder(dataTypeMap, left,
+  newBuilder, canPartialPushDownConjuncts)
+val rightBuilderOption =
+  createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts)
+(leftBuilderOption, rightBuilderOption) match {
+  case (Some(_), Some(_)) =>
+for {
+  lhs <- createBuilder(dataTypeMap, left,
+builder.startAnd(), canPartialPushDownConjuncts)
+  rhs <- createBuilder(dataTypeMap, right, lhs, 
canPartialPushDownConjuncts)
+} yield rhs.end()
+
+  case (Some(_), None) if canPartialPushDownConjuncts =>
+createBuilder(dataTypeMap, left, builder, 
canPartialPushDownConjuncts)
+
+  case (None, Some(_)) if canPartialPushDownConjuncts =>
+createBuilder(dataTypeMap, right, builder, 
canPartialPushDownConjuncts)
+
+  case _ => None
+}
 
   case Or(left, right) =>
 for {
-  _ <- buildSearchArgument(dataTypeMap, left, newBuilder)
-  _ <- buildSearchArgument(dataTypeMap, right, newBuilder)
-  lhs <- buildSearchArgument(dataTypeMap, left, builder.startOr())
-  rhs <- buildSearchArgument(dataTypeMap, right, lhs)
+  _ <- createBuilder(dataTypeMap, left, newBuilder, 
canPartialPushDownConjuncts = false)
+  _ <- createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts = false)
+  lhs <- createBuilder(dataTypeMap, left,
+builder.startOr(), canPartialPushDownConjuncts = false)
+  rhs <- createBuilder(dataTypeMap, right, lhs, 
canPartialPushDownConjuncts = false)
 } yield rhs.end()
 
   case Not(child) =>
 for {
-  _ <- buildSearchArgument(dataTypeMap, child, newBuilder)
-  negate <- buildSearchArgument(dataTypeMap, child, 
builder.startNot())
+  _ <- createBuilder(dataTypeMap, child, newBuilder, 
canPartialPushDownConjuncts = false)
+  negate <- createBuilder(dataTypeMap, child, builder.startNot(), 
false)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...

2018-10-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22684#discussion_r224179447
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala ---
@@ -90,32 +107,51 @@ private[orc] object OrcFilters extends Logging {
 
 expression match {
   case And(left, right) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
+// At here, it is not safe to just convert one side and remove the 
other side
+// if we do not understand what the parent filters are.
+//
+// Here is an example used to explain the reason.
 // Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
 // convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
 // NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
-for {
-  _ <- buildSearchArgument(dataTypeMap, left, newBuilder)
-  _ <- buildSearchArgument(dataTypeMap, right, newBuilder)
-  lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd())
-  rhs <- buildSearchArgument(dataTypeMap, right, lhs)
-} yield rhs.end()
+//
+// Pushing one side of AND down is only safe to do at the top 
level or in the child
+// AND before hitting NOT or OR conditions, and in this case, the 
unsupported predicate
+// can be safely removed.
+val leftBuilderOption = createBuilder(dataTypeMap, left,
+  newBuilder, canPartialPushDownConjuncts)
+val rightBuilderOption =
+  createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts)
+(leftBuilderOption, rightBuilderOption) match {
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...

2018-10-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22684#discussion_r224178237
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -138,39 +138,75 @@ private[sql] object OrcFilters {
   dataTypeMap: Map[String, DataType],
   expression: Filter,
   builder: Builder): Option[Builder] = {
+createBuilder(dataTypeMap, expression, builder, 
canPartialPushDownConjuncts = true)
+  }
+
+  /**
+   * @param dataTypeMap a map from the attribute name to its data type.
+   * @param expression the input filter predicates.
+   * @param builder the input SearchArgument.Builder.
+   * @param canPartialPushDownConjuncts whether a subset of conjuncts of 
predicates can be pushed
+   *down safely. Pushing ONLY one side 
of AND down is safe to
+   *do at the top level or none of its 
ancestors is NOT and OR.
+   * @return the builder so far.
+   */
+  private def createBuilder(
+  dataTypeMap: Map[String, DataType],
+  expression: Filter,
+  builder: Builder,
+  canPartialPushDownConjuncts: Boolean): Option[Builder] = {
 def getType(attribute: String): PredicateLeaf.Type =
   getPredicateLeafType(dataTypeMap(attribute))
 
 import org.apache.spark.sql.sources._
 
 expression match {
   case And(left, right) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
+// At here, it is not safe to just convert one side and remove the 
other side
+// if we do not understand what the parent filters are.
+//
+// Here is an example used to explain the reason.
 // Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
 // convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
 // NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
-for {
-  _ <- buildSearchArgument(dataTypeMap, left, newBuilder)
-  _ <- buildSearchArgument(dataTypeMap, right, newBuilder)
-  lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd())
-  rhs <- buildSearchArgument(dataTypeMap, right, lhs)
-} yield rhs.end()
+//
+// Pushing one side of AND down is only safe to do at the top 
level or in the child
+// AND before hitting NOT or OR conditions, and in this case, the 
unsupported predicate
+// can be safely removed.
+val leftBuilderOption = createBuilder(dataTypeMap, left,
+  newBuilder, canPartialPushDownConjuncts)
+val rightBuilderOption =
+  createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts)
+(leftBuilderOption, rightBuilderOption) match {
+  case (Some(_), Some(_)) =>
+for {
+  lhs <- createBuilder(dataTypeMap, left,
+builder.startAnd(), canPartialPushDownConjuncts)
+  rhs <- createBuilder(dataTypeMap, right, lhs, 
canPartialPushDownConjuncts)
+} yield rhs.end()
+
+  case (Some(_), None) if canPartialPushDownConjuncts =>
+createBuilder(dataTypeMap, left, builder, 
canPartialPushDownConjuncts)
+
+  case (None, Some(_)) if canPartialPushDownConjuncts =>
+createBuilder(dataTypeMap, right, builder, 
canPartialPushDownConjuncts)
+
+  case _ => None
+}
 
   case Or(left, right) =>
 for {
-  _ <- buildSearchArgument(dataTypeMap, left, newBuilder)
-  _ <- buildSearchArgument(dataTypeMap, right, newBuilder)
-  lhs <- buildSearchArgument(dataTypeMap, left, builder.startOr())
-  rhs <- buildSearchArgument(dataTypeMap, right, lhs)
+  _ <- createBuilder(dataTypeMap, left, newBuilder, 
canPartialPushDownConjuncts = false)
+  _ <- createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts = false)
+  lhs <- createBuilder(dataTypeMap, left,
+builder.startOr(), canPartialPushDownConjuncts = false)
+  rhs <- createBuilder(dataTypeMap, right, lhs, 
canPartialPushDownConjuncts = false)
 } yield rhs.end()
 
   case Not(child) =>
 for {
-  _ <- buildSearchArgument(dataTypeMa

[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...

2018-10-10 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22684#discussion_r224174206
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -138,39 +138,75 @@ private[sql] object OrcFilters {
   dataTypeMap: Map[String, DataType],
   expression: Filter,
   builder: Builder): Option[Builder] = {
+createBuilder(dataTypeMap, expression, builder, 
canPartialPushDownConjuncts = true)
+  }
+
+  /**
+   * @param dataTypeMap a map from the attribute name to its data type.
+   * @param expression the input filter predicates.
+   * @param builder the input SearchArgument.Builder.
+   * @param canPartialPushDownConjuncts whether a subset of conjuncts of 
predicates can be pushed
+   *down safely. Pushing ONLY one side 
of AND down is safe to
+   *do at the top level or none of its 
ancestors is NOT and OR.
+   * @return the builder so far.
+   */
+  private def createBuilder(
+  dataTypeMap: Map[String, DataType],
+  expression: Filter,
+  builder: Builder,
+  canPartialPushDownConjuncts: Boolean): Option[Builder] = {
 def getType(attribute: String): PredicateLeaf.Type =
   getPredicateLeafType(dataTypeMap(attribute))
 
 import org.apache.spark.sql.sources._
 
 expression match {
   case And(left, right) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
+// At here, it is not safe to just convert one side and remove the 
other side
+// if we do not understand what the parent filters are.
+//
+// Here is an example used to explain the reason.
 // Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
 // convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
 // NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
-for {
-  _ <- buildSearchArgument(dataTypeMap, left, newBuilder)
-  _ <- buildSearchArgument(dataTypeMap, right, newBuilder)
-  lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd())
-  rhs <- buildSearchArgument(dataTypeMap, right, lhs)
-} yield rhs.end()
+//
+// Pushing one side of AND down is only safe to do at the top 
level or in the child
+// AND before hitting NOT or OR conditions, and in this case, the 
unsupported predicate
+// can be safely removed.
+val leftBuilderOption = createBuilder(dataTypeMap, left,
+  newBuilder, canPartialPushDownConjuncts)
+val rightBuilderOption =
+  createBuilder(dataTypeMap, right, newBuilder, 
canPartialPushDownConjuncts)
--- End diff --

Can you make the format the same as `leftBuilderOption`? Also, add another 
empty line before `(leftBuilderOption, rightBuilderOption)`. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22679: [SPARK-25559] [FOLLOW-UP] Add comments for partial pushd...

2018-10-09 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22679
  
Thanks. Merged into master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22679: [SPARK-25559] [FOLLOW-UP] Add comments for partial pushd...

2018-10-09 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22679
  
LGTM. Wait for the PR build. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22574: [SPARK-25559][SQL] Remove the unsupported predicates in ...

2018-09-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22574
  
I changed the title, and hopefully, it's much more clear now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22573
  
I was thinking to change the APIs in `Filter` so we can represent nested 
fields easier, but also realized that it's a stable public interface.

Without changing the interface of `Filter`, we can have the following two 
options,

1. Use backtick to wrap around the column name and structure name 
containing dots. For example, 
  ```scala
  `column.1`.`attribute.b`
  ```
  It's also easier for people to understand when they are reading the 
pushdown plans in text format.

2. Alternatively, we can use ASCII delimited text to avoid delimiter 
collision, for example `\31` is commonly used between fields of a record, or 
members of a row. This simplifies parsing significantly, but the downside is 
that it's not readable, so when we print the plan, we need to add the backtick 
for visualization.

What do you think? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22574: [SPARK-25559][SQL] Just remove the unsupported pr...

2018-09-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22574#discussion_r221374514
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -488,26 +494,27 @@ private[parquet] class ParquetFilters(
   .map(_(nameToParquetField(name).fieldName, value))
 
   case sources.And(lhs, rhs) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
-// Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
-// convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
-// NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
--- End diff --

addressed and added more tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22574: [SPARK-25559][SQL] Just remove the unsupported predicate...

2018-09-28 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22574
  
test this again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22574: [SPARK-25559][SQL] Just remove the unsupported pr...

2018-09-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22574#discussion_r221153414
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -488,26 +494,25 @@ private[parquet] class ParquetFilters(
   .map(_(nameToParquetField(name).fieldName, value))
 
   case sources.And(lhs, rhs) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
-// Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
-// convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
-// NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
-for {
-  lhsFilter <- createFilter(schema, lhs)
-  rhsFilter <- createFilter(schema, rhs)
-} yield FilterApi.and(lhsFilter, rhsFilter)
+// If the unsupported predicate is in the top level `And` 
condition or in the child
+// `And` condition before hitting `Not` or `Or` condition, it can 
be safely removed.
+(createFilterHelper(nameToParquetField, lhs, canRemoveOneSideInAnd 
= true),
--- End diff --

Addressed. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22574: [SPARK-25559][SQL] Just remove the unsupported pr...

2018-09-28 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22574#discussion_r221152340
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -488,26 +494,25 @@ private[parquet] class ParquetFilters(
   .map(_(nameToParquetField(name).fieldName, value))
 
   case sources.And(lhs, rhs) =>
-// At here, it is not safe to just convert one side if we do not 
understand the
-// other side. Here is an example used to explain the reason.
-// Let's say we have NOT(a = 2 AND b in ('1')) and we do not 
understand how to
-// convert b in ('1'). If we only convert a = 2, we will end up 
with a filter
-// NOT(a = 2), which will generate wrong results.
-// Pushing one side of AND down is only safe to do at the top 
level.
-// You can see ParquetRelation's initializeLocalJobFunc method as 
an example.
-for {
-  lhsFilter <- createFilter(schema, lhs)
-  rhsFilter <- createFilter(schema, rhs)
-} yield FilterApi.and(lhsFilter, rhsFilter)
+// If the unsupported predicate is in the top level `And` 
condition or in the child
+// `And` condition before hitting `Not` or `Or` condition, it can 
be safely removed.
+(createFilterHelper(nameToParquetField, lhs, canRemoveOneSideInAnd 
= true),
+  createFilterHelper(nameToParquetField, rhs, 
canRemoveOneSideInAnd = true)) match {
--- End diff --

Thanks for catching this. I just fixed it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22573: [SPARK-25558][SQL] Pushdown predicates for nested...

2018-09-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22573#discussion_r221128544
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
 ---
@@ -437,53 +436,65 @@ object DataSourceStrategy {
* @return a `Some[Filter]` if the input [[Expression]] is convertible, 
otherwise a `None`.
*/
   protected[sql] def translateFilter(predicate: Expression): 
Option[Filter] = {
+// Recursively try to find an attribute name from the top level that 
can be pushed down.
+def attrName(e: Expression): Option[String] = e match {
+  // In Spark and many data sources such as parquet, dots are used as 
a column path delimiter;
+  // thus, we don't translate such expressions.
+  case a: Attribute if !a.name.contains(".") =>
+Some(a.name)
--- End diff --

Do we have any data source currently supporting `dot` in the column name 
with pushdown? The worst case will be no pushdown for those data sources.

I know ORC doesn't work for now. We can have another followup PR to address 
this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22574: [SPARK-25556][SQL] Just remove the unsupported predicate...

2018-09-27 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22574
  
cc @gatorsmile @cloud-fan @HyukjinKwon @dongjoon-hyun @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22574: [SPARK-25556][SQL] Just remove the unsupported pr...

2018-09-27 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/22574

[SPARK-25556][SQL] Just remove the unsupported predicates in Parquet

## What changes were proposed in this pull request?

Currently, in `ParquetFilters`, if one of the children predicates is not 
supported by Parquet, the entire predicates will be thrown away. In fact, if 
the unsupported predicate is in the top level `And` condition or in the child 
before hitting `Not` or `Or` condition, it can be safely removed.

## How was this patch tested?

Tests are added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark 
removeUnsupportedPredicatesInParquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22574.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22574


commit d49d63bc40a7752990583f9afbd10c68025510b3
Author: DB Tsai 
Date:   2018-09-27T22:12:44Z

Remove unsupported predicates in parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...

2018-09-27 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22535
  
I'm breaking this PRs into three smaller PR. I'll fix the tests in those 
smaller PRs. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...

2018-09-27 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22573
  
@gatorsmile @cloud-fan @dongjoon-hyun @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22573: [SPARK-25558][SQL] Pushdown predicates for nested...

2018-09-27 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/22573

[SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource 
Strategy

## What changes were proposed in this pull request?

This PR allows Spark to create predicates for nested fields, and it's a 
building block to have Parquet and ORC to support the nested predicate pushdown.

## How was this patch tested?

Tests added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark dataSourcePredicate

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22573.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22573


commit 2f21842d4676993d0d28abb6297796c672186f53
Author: DB Tsai 
Date:   2018-09-27T18:24:38Z

DataSourceStrategy




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdow...

2018-09-24 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/22535

[SPARK-17636][SQL][WIP] Parquet predicate pushdown in nested fields

## What changes were proposed in this pull request?

Support Parquet predicate pushdown in nested fields

## How was this patch tested?

Existing tests and new tests are added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark parquetNesting

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22535.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22535


commit c95706f60e4d576caca78a32000d4a7bbb12c141
Author: DB Tsai 
Date:   2018-09-06T00:22:09Z

Nested parquet pushdown




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation...

2018-09-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22418#discussion_r218272427
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -50,6 +55,66 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   .createOrReplaceTempView("orc_temp_table")
   }
 
+  protected def testBloomFilterCreation(bloomFilterKind: Kind) {
+val tableName = "bloomFilter"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (a INT, b STRING)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.bloom.filter.columns '*',
+   |  orc.bloom.filter.fpp 0.1
+   |)
+""".stripMargin
+  case "hive" =>
+s"""
+   |CREATE TABLE $tableName (a INT, b STRING)
+   |STORED AS ORC
+   |LOCATION '${dir.toURI}'
+   |TBLPROPERTIES (
+   |  orc.bloom.filter.columns='*',
+   |  orc.bloom.filter.fpp=0.1
+   |)
+""".stripMargin
+  case impl =>
+throw new UnsupportedOperationException(s"Unknown ORC 
implementation: $impl")
+}
+
+sql(sqlStatement)
+sql(s"INSERT INTO $tableName VALUES (1, 'str')")
+
+val partFiles = dir.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+val orcFilePath = new Path(partFiles.head.getAbsolutePath)
+val readerOptions = OrcFile.readerOptions(new Configuration())
+val reader = OrcFile.createReader(orcFilePath, readerOptions)
+var recordReader: RecordReaderImpl = null
+try {
+  recordReader = reader.rows.asInstanceOf[RecordReaderImpl]
+
+  // BloomFilter array is created for all types; `struct`, int 
(`a`), string (`b`)
+  val sargColumns = Array(true, true, true)
+  val orcIndex = recordReader.readRowIndex(0, null, sargColumns)
+
+  // Check the types and counts of bloom filters
+  assert(orcIndex.getBloomFilterKinds.forall(_ === 
bloomFilterKind))
--- End diff --

Something like

```
== Physical Plan ==
*(1) Project [_1#3]
+- *(1) Filter (isnotnull(_1#3) && (_1#3._1 = true))
   +- *(1) FileScan parquet [_1#3] Batched: false, Format: Orc, 
  PushedFilters: [IsNotNull(_1), EqualTo(_1._1,true)]
  BloomFilters: [some information]
```

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation...

2018-09-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22418#discussion_r218158845
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala
 ---
@@ -50,6 +55,66 @@ abstract class OrcSuite extends OrcTest with 
BeforeAndAfterAll {
   .createOrReplaceTempView("orc_temp_table")
   }
 
+  protected def testBloomFilterCreation(bloomFilterKind: Kind) {
+val tableName = "bloomFilter"
+
+withTempDir { dir =>
+  withTable(tableName) {
+val sqlStatement = orcImp match {
+  case "native" =>
+s"""
+   |CREATE TABLE $tableName (a INT, b STRING)
+   |USING ORC
+   |OPTIONS (
+   |  path '${dir.toURI}',
+   |  orc.bloom.filter.columns '*',
+   |  orc.bloom.filter.fpp 0.1
+   |)
+""".stripMargin
+  case "hive" =>
+s"""
+   |CREATE TABLE $tableName (a INT, b STRING)
+   |STORED AS ORC
+   |LOCATION '${dir.toURI}'
+   |TBLPROPERTIES (
+   |  orc.bloom.filter.columns='*',
+   |  orc.bloom.filter.fpp=0.1
+   |)
+""".stripMargin
+  case impl =>
+throw new UnsupportedOperationException(s"Unknown ORC 
implementation: $impl")
+}
+
+sql(sqlStatement)
+sql(s"INSERT INTO $tableName VALUES (1, 'str')")
+
+val partFiles = dir.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+val orcFilePath = new Path(partFiles.head.getAbsolutePath)
+val readerOptions = OrcFile.readerOptions(new Configuration())
+val reader = OrcFile.createReader(orcFilePath, readerOptions)
+var recordReader: RecordReaderImpl = null
+try {
+  recordReader = reader.rows.asInstanceOf[RecordReaderImpl]
+
+  // BloomFilter array is created for all types; `struct`, int 
(`a`), string (`b`)
+  val sargColumns = Array(true, true, true)
+  val orcIndex = recordReader.readRowIndex(0, null, sargColumns)
+
+  // Check the types and counts of bloom filters
+  assert(orcIndex.getBloomFilterKinds.forall(_ === 
bloomFilterKind))
--- End diff --

It seems the test here  is creating orc with bloom filter using spark with 
options, and read it back through native ORC reader. Is there a plan to add an 
optimizer rule in Spark to show the functionality of this in the physical plan 
like predicate pushdown in parquet? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22431: [SPARK-24418][FOLLOWUP][DOC] Update docs to show Scala 2...

2018-09-15 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22431
  
Thanks! Merged into both branch 2.4 and master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22394: [SPARK-25406][SQL] For ParquetSchemaPruningSuite.scala, ...

2018-09-13 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22394
  
LGTM. Merged into master and branch 2.4. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22409: [SPARK-25352][SQL][Followup] Add helper method and addre...

2018-09-13 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22409
  
LGTM. Wait for the test. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...

2018-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22344#discussion_r217128163
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -68,22 +68,42 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   object SpecialLimits extends Strategy {
 override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
   case ReturnAnswer(rootPlan) => rootPlan match {
-case Limit(IntegerLiteral(limit), Sort(order, true, child))
-if limit < conf.topKSortFallbackThreshold =>
-  TakeOrderedAndProjectExec(limit, order, child.output, 
planLater(child)) :: Nil
-case Limit(IntegerLiteral(limit), Project(projectList, Sort(order, 
true, child)))
-if limit < conf.topKSortFallbackThreshold =>
-  TakeOrderedAndProjectExec(limit, order, projectList, 
planLater(child)) :: Nil
+case Limit(IntegerLiteral(limit), s@Sort(order, true, child)) =>
+  if (limit < conf.topKSortFallbackThreshold) {
--- End diff --

Also, please add `space` in-between s and `@`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-12 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22357
  
Thanks all again. Merged into 2.4 branch and master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-11 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22357
  
LGTM. 

Thank you all for participating the discussion. @cloud-fan and @gatorsmile, 
do you have any further comment? If not, I would like to merge it tomorrow into 
both master and rc branch as it's an important performance fix for schema 
pruning. 

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216776055
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

Instead of reading any arbitrary field of simple type (which may not exist 
if it's a deeply nested struct), I think we should implement the pushdown with 
complex type in parquet with similar logic, and let parquet reader handle it. 

@viirya Can you create a followup JIRA for this? 

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-11 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22357
  
FYI, @mallman I'm working on having `ParquetFilter` to support 
`IsNotNull(employer.id)` to be pushed into parquet reader. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216559045
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+// Kind of expressions don't need to access any fields of a root 
fields, e.g., `IsNotNull`.
+// For them, if there are any nested fields accessed in the query, we 
don't need to add root
+// field access of above expressions.
+// For example, for a query `SELECT name.first FROM contacts WHERE 
name IS NOT NULL`,
+// we don't need to read nested fields of `name` struct other than 
`first` field.
--- End diff --

For the first query, the constrain is `employer is not null`. 

When `employer.id` is not `null`, `employer` will always  not be `null`; as 
a result, this PR will work. 

However, when `employer.id` is `null`, `employer` can be `null` or 
`something`, so we need to check if `employer` is `something` to return a null 
of `employer.id`.

I checked in the `ParquetFilter`, `IsNotNull(employer)` will be ignored 
since it's not a valid parquet filter as parquet doesn't support pushdown on 
the struct; thus, with this PR, this query will return wrong answer. 

I think in this scenario, as @mallman suggested, we might need to read the 
full data. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216204022
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite
   Row(null) :: Row(null) :: Nil)
   }
 
+  testSchemaPruning("select a single complex field and in where clause") {
+val query1 = sql("select name.first from contacts where name.first = 
'Jane'")
+checkScan(query1, "struct>")
+checkAnswer(query1, Row("Jane") :: Nil)
+
+val query2 = sql("select name.first, name.last from contacts where 
name.first = 'Jane'")
+checkScan(query2, "struct>")
+checkAnswer(query2, Row("Jane", "Doe") :: Nil)
+
+val query3 = sql("select name.first from contacts " +
+  "where employer.company.name = 'abc' and p = 1")
--- End diff --

Let's say a user adds  `where employer.company is not null`, can we still 
read schema with `employer:struct>>` as we only 
mark `contentAccessed = false` when `IsNotNull` is on an attribute?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...

2018-09-09 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22357#discussion_r216202879
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -110,7 +110,12 @@ private[sql] object ParquetSchemaPruning extends 
Rule[LogicalPlan] {
 val projectionRootFields = projects.flatMap(getRootFields)
 val filterRootFields = filters.flatMap(getRootFields)
 
-(projectionRootFields ++ filterRootFields).distinct
+val (rootFields, optRootFields) = (projectionRootFields ++ 
filterRootFields)
+  .distinct.partition(_.contentAccessed)
--- End diff --

Some comments here please. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-09 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22357
  
cc @beettlle 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >