[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 @rxin `switch` in Java is still significantly faster than hash set even without boxing / unboxing problems when the number of elements are small. We were thinking about to have two implementations in `InSet`, and pick up `switch` if the number of elements are small, or otherwise pick up hash set one. But this is the same complexity as having two implements in `In` as this PR. @cloud-fan do you suggest to create an `OptimizeIn` which has `switch` and hash set implementations based on the length of the elements and remove `InSet`? Basically, what we were thinking above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 @cloud-fan as @aokolnychyi said, `switch` will still be faster than optimized `Set` without autoboxing when the number of elements are small. As a result, this PR is still very useful. @mgaido91 `InSet` can be better when we implement properly without autoboxing for large numbers of elements controlled by `spark.sql.optimizer.inSetConversionThreshold`. Also, generating `In` with huge lists can cause a compile exception due to the method size limit as you pointed out. As a result, we should convert it into `InSet` for large set. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23100 I went through the PR again, and it looks right to me. Merged into master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23100 It's hard to track the huge diffs on renaming. I don't go though it line-by-line. But if they're just renaming, the rest LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts, ints
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23171 The approach looks great, and can significantly improve the performance. For Long, I agree that we should also implement binary search approach for `O(logn)` look up. Wondering which one will be faster, binary search using arrays or rewrite the `if-else` in binary search form. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts,...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23171#discussion_r237227892 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -335,6 +343,41 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate { """.stripMargin) } + private def genCodeWithSwitch(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val (nullLiterals, nonNullLiterals) = list.partition { + case Literal(null, _) => true + case _ => false +} --- End diff -- If there is null in the list, it will be only one. As a result, we may not need to use `nullLiterals`. ```scala val containNullInList = ... val nonNullLiterals = ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23171: [SPARK-26205][SQL] Optimize In for bytes, shorts,...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23171#discussion_r237226275 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala --- @@ -335,6 +343,41 @@ case class In(value: Expression, list: Seq[Expression]) extends Predicate { """.stripMargin) } + private def genCodeWithSwitch(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val (nullLiterals, nonNullLiterals) = list.partition { + case Literal(null, _) => true + case _ => false +} +val listGen = nonNullLiterals.map(_.genCode(ctx)) +val valueGen = value.genCode(ctx) + +val caseBranches = listGen.map(literal => + s""" + |case ${literal.value}: + | ${ev.value} = true; + | break; + """.stripMargin) + +ev.copy(code = + code""" + |${valueGen.code} + |${CodeGenerator.JAVA_BOOLEAN} ${ev.isNull} = ${valueGen.isNull}; + |${CodeGenerator.JAVA_BOOLEAN} ${ev.value} = false; + |if (!${valueGen.isNull}) { + | switch (${valueGen.value}) { + |${caseBranches.mkString("")} + |default: + | ${ev.isNull} = ${nullLiterals.nonEmpty}; + | } + |} + """.stripMargin) + } + + private def isSwitchCompatible: Boolean = list.forall { +case Literal(_, dt) => dt == ByteType || dt == ShortType || dt == IntegerType --- End diff -- ```scala case Literal(_, dt) if dt == ByteType || dt == ShortType || dt == IntegerType => true ``` is easier to read? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23148 Thanks for testing it out. I personally like auto-formatting as my company projects are using scalafmt and we find it's very useful to keep consistent coding style. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23148: [SPARK-26177] Automated formatting for Scala code
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23148 My concern is that let's say we have a code like the following which I copied from `ParquetSchemaPruningSuite.scala`; the scalafmt will complaint the second line is longer than 98 and reformat it. But this should be a legit coding style as many times, we are trying to put the code in one line for better readability. ```scala checkScan(query, "struct,address:string,pets:int," + "friends:array>," + "relatives:map>>") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23148: [SPARK-26177] Automated formatting for Scala code
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23148#discussion_r236830497 --- Diff: dev/.scalafmt.conf --- @@ -0,0 +1,24 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +align = none +align.openParenDefnSite = false +align.openParenCallSite = false +align.tokens = [] +docstrings = JavaDoc +maxColumn = 98 --- End diff -- If we set it as `98`, will this complain for legit code with 100 chars? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23148: [SPARK-26177] Automated formatting for Scala code
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23148#discussion_r236813956 --- Diff: dev/.scalafmt.conf --- @@ -0,0 +1,24 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +align = none +align.openParenDefnSite = false +align.openParenCallSite = false +align.tokens = [] +docstrings = JavaDoc +maxColumn = 98 --- End diff -- Are we using 100 for maxColumn? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23139 Thanks. Merged into master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23139 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule Repla...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23139#discussion_r236463594 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicate.scala --- @@ -79,29 +80,31 @@ object ReplaceNullWithFalseInPredicate extends Rule[LogicalPlan] { * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or * `Literal(null, BooleanType)`. */ - private def replaceNullWithFalse(e: Expression): Expression = { -if (e.dataType != BooleanType) { + private def replaceNullWithFalse(e: Expression): Expression = e match { +case Literal(null, BooleanType) => + FalseLiteral +case And(left, right) => + And(replaceNullWithFalse(left), replaceNullWithFalse(right)) +case Or(left, right) => + Or(replaceNullWithFalse(left), replaceNullWithFalse(right)) +case cw: CaseWhen if cw.dataType == BooleanType => + val newBranches = cw.branches.map { case (cond, value) => +replaceNullWithFalse(cond) -> replaceNullWithFalse(value) + } + val newElseValue = cw.elseValue.map(replaceNullWithFalse) + CaseWhen(newBranches, newElseValue) +case i @ If(pred, trueVal, falseVal) if i.dataType == BooleanType => + If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), replaceNullWithFalse(falseVal)) +case e if e.dataType == BooleanType => e -} else { - e match { -case Literal(null, BooleanType) => - FalseLiteral -case And(left, right) => - And(replaceNullWithFalse(left), replaceNullWithFalse(right)) -case Or(left, right) => - Or(replaceNullWithFalse(left), replaceNullWithFalse(right)) -case cw: CaseWhen => - val newBranches = cw.branches.map { case (cond, value) => -replaceNullWithFalse(cond) -> replaceNullWithFalse(value) - } - val newElseValue = cw.elseValue.map(replaceNullWithFalse) - CaseWhen(newBranches, newElseValue) -case If(pred, trueVal, falseVal) => - If(replaceNullWithFalse(pred), -replaceNullWithFalse(trueVal), -replaceNullWithFalse(falseVal)) -case _ => e +case e => + val message = "Expected a Boolean type expression in replaceNullWithFalse, " + +s"but got the type `${e.dataType.catalogString}` in `${e.sql}`." + if (Utils.isTesting) { +throw new IllegalArgumentException(message) --- End diff -- Sounds fair. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r236411677 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package org.apache.spark.ml.feature +import org.apache.hadoop.fs.Path --- End diff -- Or we can file two PRs. One for removing old `OneHotEncoder`, and the other one for renaming. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r236410750 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package org.apache.spark.ml.feature +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException import org.apache.spark.annotation.Since -import org.apache.spark.ml.Transformer +import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.attribute._ import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset} -import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.{DoubleType, NumericType, StructType} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data during transform(). + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Note that this Param is only used during transform; during fitting, invalid data + * will result in an error. + * Default: "error" + * @group param + */ + @Since("2.3.0") --- End diff -- As we discussed previously, it's a new class. Should we make it as `3.0`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23100#discussion_r236410306 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala --- @@ -17,126 +17,512 @@ package org.apache.spark.ml.feature +import org.apache.hadoop.fs.Path --- End diff -- I guess once the commits of the history are squashed into one, it will still like this without better history. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule Repla...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/23139#discussion_r236394865 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicate.scala --- @@ -79,29 +80,31 @@ object ReplaceNullWithFalseInPredicate extends Rule[LogicalPlan] { * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or * `Literal(null, BooleanType)`. */ - private def replaceNullWithFalse(e: Expression): Expression = { -if (e.dataType != BooleanType) { + private def replaceNullWithFalse(e: Expression): Expression = e match { +case Literal(null, BooleanType) => + FalseLiteral +case And(left, right) => + And(replaceNullWithFalse(left), replaceNullWithFalse(right)) +case Or(left, right) => + Or(replaceNullWithFalse(left), replaceNullWithFalse(right)) +case cw: CaseWhen if cw.dataType == BooleanType => + val newBranches = cw.branches.map { case (cond, value) => +replaceNullWithFalse(cond) -> replaceNullWithFalse(value) + } + val newElseValue = cw.elseValue.map(replaceNullWithFalse) + CaseWhen(newBranches, newElseValue) +case i @ If(pred, trueVal, falseVal) if i.dataType == BooleanType => + If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), replaceNullWithFalse(falseVal)) +case e if e.dataType == BooleanType => e -} else { - e match { -case Literal(null, BooleanType) => - FalseLiteral -case And(left, right) => - And(replaceNullWithFalse(left), replaceNullWithFalse(right)) -case Or(left, right) => - Or(replaceNullWithFalse(left), replaceNullWithFalse(right)) -case cw: CaseWhen => - val newBranches = cw.branches.map { case (cond, value) => -replaceNullWithFalse(cond) -> replaceNullWithFalse(value) - } - val newElseValue = cw.elseValue.map(replaceNullWithFalse) - CaseWhen(newBranches, newElseValue) -case If(pred, trueVal, falseVal) => - If(replaceNullWithFalse(pred), -replaceNullWithFalse(trueVal), -replaceNullWithFalse(falseVal)) -case _ => e +case e => + val message = "Expected a Boolean type expression in replaceNullWithFalse, " + +s"but got the type `${e.dataType.catalogString}` in `${e.sql}`." + if (Utils.isTesting) { +throw new IllegalArgumentException(message) --- End diff -- Test for this? Why not also throw exception in runtime since this should never be hit? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23139: [SPARK-25860][SPARK-26107] [FOLLOW-UP] Rule ReplaceNullW...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23139 Although we are trying to make sure in the caller side to only call `replaceNullWithFalse` when the expression is boolean type, I agree that for safety, we should check it and throw exception for future development. LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23118: [SPARK-26144][BUILD] `build/mvn` should detect `scala.ve...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/23118 Late to the party! Thanks @dongjoon-hyun for taking care of this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22967 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22967 @dongjoon-hyun thanks for trigging the build. The python test script was only looking for scala 2.11 jars resulting python test failures. I just fixed it in the latest push. Let's see how it goes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r22593 --- Diff: pom.xml --- @@ -2718,7 +2710,6 @@ *:*_2.11 -*:*_2.10 --- End diff -- Thanks for the suggestion, and I agree this will make the default scala 2.12 profile cleaner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22977: [SPARK-26030][BUILD] Bump previousSparkVersion in MimaBu...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22977 LGTM. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22764: [SPARK-25765][ML] Add training cost to BisectingKMeans s...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22764 @mgaido91 I'm on thanksgiving vacation, will be back to community to help code review on Nov 21st. Sorry for the delay. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22967 Waiting https://github.com/apache/spark/pull/22977 to be merged, and I'll rebase from it and fix the remaining binary incompatibilities. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r232510439 --- Diff: pom.xml --- @@ -2717,7 +2717,6 @@ -*:*_2.11 *:*_2.10 --- End diff -- Make sense. I made the parent rule to exclude 2.10, and moved the exclusion of 2.11 to 2.12 profile. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r232505402 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- I just did a text search and replacement, and I didn't read the context of having `sparkPackages = "com.databricks:spark-avro_2.11:3.0.0"` here. My bad. Although avro is now part of spark codebase, but it's in external package which is not in the classpath by default. How about I change it to `sparkPackages = "org.apache.spark:spark-avro_2.12:3.0.0"` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r232024914 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- I thought `com.databricks:spark-avro_2.12` is deprecated and no longer exist. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r232024557 --- Diff: pom.xml --- @@ -1998,7 +1998,7 @@ --> org.jboss.netty org.codehaus.groovy - *:*_2.10 + *:*_2.11 --- End diff -- @srowen Can you take a look if this looks right now? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231781938 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- Get you. BTW, are you familiar with Mima? I still can not figure out why it's still failing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22967#discussion_r231781635 --- Diff: docs/sparkr.md --- @@ -133,7 +133,7 @@ specifying `--packages` with `spark-submit` or `sparkR` commands, or if initiali {% highlight r %} -sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0") +sparkR.session() --- End diff -- I am not familiar with R. Can you elaborate? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22966 jmh is a framework to write benchmark that can generate standardized reports to be consumed by Jenkins. Here is an example, https://github.com/pvillega/jmh-scala-test/blob/master/src/main/scala/com/perevillega/JMHTest.scala --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22967 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22970: [SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 build erro...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22970 Merged into master as the compilation finished. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22970: [SPARK-25676][FOLLOWUP][BUILD] Fix Scala 2.12 build erro...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22970 LGTM. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22966: [PARK-25965][SQL][TEST] Add avro read benchmark
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22966 cc @jleach4 and @aokolnychyi We have a great success using [jmh](http://openjdk.java.net/projects/code-tools/jmh/) for this type of benchmarking; the benchmarks can be written in the unit test. This framework handles JVM warn-up, computes the latency, and throughput, etc, and then generates reports that can be consumed in Jenkins. We also use Jenkins to visualize the trend of performance changes which is very useful to find regressions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22967: [SPARK-25956] Make Scala 2.12 as default Scala version i...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22967 @dongjoon-hyun Yeah, seems https://github.com/apache/spark/commit/63ca4bbe792718029f6d6196e8a6bb11d1f20fca breaks the Scala 2.12 build. I'll re-trigger the build once Scala 2.12 build is fixed. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22967: [SPARK-25956] Make Scala 2.12 as default Scala ve...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/22967 [SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 ## What changes were proposed in this pull request? This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds. We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11. ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark scala2.12 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22967.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22967 commit 635e6e23c5066018fd656738c51d02df8130585e Author: DB Tsai Date: 2018-11-06T22:13:11Z make scala 2.12 as default commit 5011dc07c6462e7f5a9974a0b9b28f937d678297 Author: DB Tsai Date: 2018-11-06T23:11:34Z sbt change commit b4b9cb95df35b754432fb74361c32f563d1661b0 Author: DB Tsai Date: 2018-11-07T00:02:22Z address feedback commit 292adb111750cfe98593f12f64ebe11067482b44 Author: DB Tsai Date: 2018-11-07T00:35:58Z address feedback --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to suppo...
Github user dbtsai closed the pull request at: https://github.com/apache/spark/pull/22953 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22953 Thanks. Merged into master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22947: [SPARK-24913][SQL] Make AssertNotNull and AssertT...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22947#discussion_r230975661 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala --- @@ -66,6 +66,8 @@ case class AssertTrue(child: Expression) extends UnaryExpression with ImplicitCa override def nullable: Boolean = true + override lazy val deterministic: Boolean = false --- End diff -- Because of this, I'm leaning towards creating a new flag instead of making them non-deterministic. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22953 ASM6 supports Java 9 while ASM7 supports Java 9, Java 10, and Java 11. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK1...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22953 cc @gatorsmile @srowen @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22953: [SPARK-25946] [BUILD] Upgrade ASM to 7.x to suppo...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/22953 [SPARK-25946] [BUILD] Upgrade ASM to 7.x to support JDK11 ## What changes were proposed in this pull request? Upgrade ASM to 7.x to support JDK11 ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark asm7 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22953 commit 7b19dc8616670c8db4b853e7fcfd192f8f55e09a Author: DB Tsai Date: 2018-11-05T23:03:10Z upgrade asm to 7.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22786: [SPARK-25764][ML][EXAMPLES] Update BisectingKMeans examp...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22786 LGTM. Merged into master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22869: [SPARK-25758][ML] Deprecate computeCost in BisectingKMea...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22869 LGTM too. Merged into master. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22919: [SPARK-25906][SHELL] Restores '-i' option's behaviour in...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22919 I'm also on @cloud-fan's side---we should keep it consistent with the upstream Scala Shell. However, we should document it on `./bin/spark-shell --help`, so when a user complains or files a ticket, we can refer them to the doc. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22857 Thanks all for reviewing! The latest change looks good to me too. Merged into master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22880: [SPARK-25407][SQL] Ensure we pass a compatible pruned sc...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22880 I can confirm that this fixes https://issues.apache.org/jira/browse/SPARK-25879 cc @cloud-fan @gatorsmile Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21320 cc @viirya If we select a nested field and a top level field, the schema pruning will fail. Here is the reproducible test, ```scala testSchemaPruning("select a single complex field and a top level field") { val query = sql("select * from contacts") .select("name.middle", "address") query.explain(true) query.printSchema() query.show() checkScan(query, "struct,address:string>") } ``` and the exception is ``` 23:16:05.864 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 3.0 (TID 6) org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details: at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:193) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:674) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:850) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:325) at org.apache.spark.rdd.RDD.iterator(RDD.scala:289) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:419) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:425) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/private/var/folders/pr/4q3b9vkx36lbygjr5jhfmjcwgn/T/spark-a4fff68d-d51a-4c79-aa18-54cfd7f81a75/contacts/p=2/part-0-8a4d9396-7be3-4fed-a55a-5580684ebda6-c000.snappy.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) ... 19 more Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99) at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:97) at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:92) at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) ... 24 more 23:16:05.896 WARN org.apache.spark.
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228741341 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => +val newBranches = branches.map { case (cond, value) => + replaceNullWithFalse(cond) -> value +} +CaseWhen(newBranches, elseValue) +} + } + + /** + * Recursively replaces `Literal(null, _)` with `FalseLiteral`. + * + * Note that `transformExpressionsDown` can not be used here as we must stop as soon as we hit + * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or `Literal(null, _)`. + */ + private def replaceNullWithFalse(e: Expression): Expression = e match { --- End diff -- We only do the replacements when 1) within `Join` or `Filter` such as `Filter(If(cond, FalseLiteral, Literal(null, _)))`, or 2) `If(Literal(null, _), trueValue, falseValue)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228739082 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => +val newBranches = branches.map { case (cond, value) => + replaceNullWithFalse(cond) -> value +} +CaseWhen(newBranches, elseValue) +} + } + + /** + * Recursively replaces `Literal(null, _)` with `FalseLiteral`. + * + * Note that `transformExpressionsDown` can not be used here as we must stop as soon as we hit + * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or `Literal(null, _)`. + */ + private def replaceNullWithFalse(e: Expression): Expression = e match { +case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) => + val newBranches = cw.branches.map { case (cond, value) => +replaceNullWithFalse(cond) -> replaceNullWithFalse(value) + } + val newElseValue = cw.elseValue.map(replaceNullWithFalse) + CaseWhen(newBranches, newElseValue) +case If(pred, trueVal, falseVal) if Seq(trueVal, falseVal).forall(isNullOrBoolean) => --- End diff -- Nit, in other place, we use `trueValue` and `falseValue`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228739018 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseSuite.scala --- @@ -0,0 +1,324 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.{And, CaseWhen, Expression, GreaterThan, If, Literal, Or} +import org.apache.spark.sql.catalyst.expressions.Literal.{FalseLiteral, TrueLiteral} +import org.apache.spark.sql.catalyst.plans.{Inner, PlanTest} +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor +import org.apache.spark.sql.types.{BooleanType, IntegerType} + +class ReplaceNullWithFalseSuite extends PlanTest { + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("Replace null literals", FixedPoint(10), +NullPropagation, +ConstantFolding, +BooleanSimplification, +SimplifyConditionals, +ReplaceNullWithFalse) :: Nil + } + + private val testRelation = LocalRelation('i.int, 'b.boolean) + private val anotherTestRelation = LocalRelation('d.int) + + test("successful replacement of null literals in filter and join conditions (1)") { +testFilter(originalCond = Literal(null), expectedCond = FalseLiteral) +testJoin(originalCond = Literal(null), expectedCond = FalseLiteral) + } + + test("successful replacement of null literals in filter and join conditions (2)") { +val originalCond = If( + UnresolvedAttribute("i") > Literal(10), + FalseLiteral, + Literal(null, BooleanType)) +testFilter(originalCond, expectedCond = FalseLiteral) +testJoin(originalCond, expectedCond = FalseLiteral) + } + + test("successful replacement of null literals in filter and join conditions (3)") { +val originalCond = If( + UnresolvedAttribute("i") > Literal(10), + TrueLiteral && Literal(null, BooleanType), + UnresolvedAttribute("b") && Literal(null, BooleanType)) +testFilter(originalCond, expectedCond = FalseLiteral) +testJoin(originalCond, expectedCond = FalseLiteral) + } + + test("successful replacement of null literals in filter and join conditions (4)") { +val branches = Seq( + (UnresolvedAttribute("i") < Literal(10)) -> TrueLiteral, + (UnresolvedAttribute("i") > Literal(40)) -> FalseLiteral) +val originalCond = CaseWhen(branches, Literal(null, BooleanType)) +val expectedCond = CaseWhen(branches, FalseLiteral) +testFilter(originalCond, expectedCond) +testJoin(originalCond, expectedCond) + } + + test("successful replacement of null literals in filter and join conditions (5)") { +val branches = Seq( + (UnresolvedAttribute("i") < Literal(10)) -> Literal(null, BooleanType), + (UnresolvedAttribute("i") > Literal(40)) -> FalseLiteral) +val originalCond = CaseWhen(branches, Literal(null)) +testFilter(originalCond, expectedCond = FalseLiteral) +testJoin(originalCond, expectedCond = FalseLiteral) + } + + test("successful replacement of null literals in filter and join conditions (6)") { +val originalBranches = Seq( + (UnresolvedAttribute("i") < Literal(10)) -> +If(UnresolvedAttribute("i") < Literal(20), Literal(null, BooleanType), FalseLiteral)
[GitHub] spark issue #22857: [SPARK-25860][SQL] Replace Literal(null, _) with FalseLi...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22857 LGTM. @cloud-fan and @gatorsmile, this is the PR I mentioned to you earlier this year in the SF Spark summit which can simplify some of our queries. Also add @dongjoon-hyun and @viirya Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22857#discussion_r228738623 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] { flattenConcats(concat) } } + +/** + * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further optimizations. + * + * For example, `Filter(Literal(null, _))` is equal to `Filter(FalseLiteral)`. + * + * Another example containing branches is `Filter(If(cond, FalseLiteral, Literal(null, _)))`; + * this can be optimized to `Filter(If(cond, FalseLiteral, FalseLiteral))`, and eventually + * `Filter(FalseLiteral)`. + * + * As a result, many unnecessary computations can be removed in the query optimization phase. + * + * Similarly, the same logic can be applied to conditions in [[Join]], predicates in [[If]], + * conditions in [[CaseWhen]]. + */ +object ReplaceNullWithFalse extends Rule[LogicalPlan] { + + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case f @ Filter(cond, _) => f.copy(condition = replaceNullWithFalse(cond)) +case j @ Join(_, _, _, Some(cond)) => j.copy(condition = Some(replaceNullWithFalse(cond))) +case p: LogicalPlan => p transformExpressions { + case i @ If(pred, _, _) => i.copy(predicate = replaceNullWithFalse(pred)) + case CaseWhen(branches, elseValue) => --- End diff -- Nit, ```scala case cw @ CaseWhen(branches, _) => .. .. cw.copy(branches = newBranches) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22839: [SPARK-25656][SQL][DOC][EXAMPLE][BRANCH-2.4] Add a doc a...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22839 Thanks @dongjoon-hyun This LGTM! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22801: [SPARK-25656][SQL][DOC][EXAMPLE] Add a doc and examples ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22801 This LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22788: [SPARK-25769][SQL]escape nested columns by backtick each...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22788 @cloud-fan I like the idea of using JSON, but that will also change the definition of string format. Do we just use JSON for nested case so the existing data source doesn't have to be changed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22788: [SPARK-25769][SQL]escape nested columns by backtick each...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22788 @cloud-fan @dongjoon-hyun instead of changing `Filter` API, do you think using proper escaped char like this PR in https://github.com/apache/spark/pull/22573 is a good approach? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22597: [SPARK-25579][SQL] Use quoted attribute names if ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22597#discussion_r225309479 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { )).get.toString } } + + test("SPARK-25579 ORC PPD should support column names with dot") { +import testImplicits._ + +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") { + withTempDir { dir => +val path = new File(dir, "orc").getCanonicalPath +Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path) +val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2") +checkAnswer(stripSparkFilter(df), Row(1, 2)) --- End diff -- How do we generalize this into nested cases? The parent struct can contain dot as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22597 In `ParquetFilter`, the way we test if a predicate pushdown works is by removing that predicate from Spark SQL physical plan, and only relying on the reader to do the filter. Thus, if there is a bug in pushdown filter in reader, Spark will get the incorrect result. This can use in test to ensure no regression later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22597: [SPARK-25579][SQL] Use quoted attribute names if needed ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22597 Is it possible to add tests like parquet to remove the filter in Spark SQL to ensure that the predicate is pushed down to the reader? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][SQL][TEST] Refactor DataSourceReadBenchmar...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22664 @peter-toth I assigned to you. Thanks for contribution. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22664: [SPARK-25662][SQL][TEST] Refactor DataSourceReadBenchmar...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22664 Thanks @dongjoon-hyun for ping me. LGTM too. We're working on some parquet reader improvement, and this will be useful. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22684: [SPARK-25699][SQL] Partially push down conjunctive predi...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22684 Merged into master. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22684: [SPARK-25699][SQL] Partially push down conjunctive predi...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22684 LGTM. Just some styling feedback. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22684#discussion_r224179579 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala --- @@ -90,32 +107,51 @@ private[orc] object OrcFilters extends Logging { expression match { case And(left, right) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. +// At here, it is not safe to just convert one side and remove the other side +// if we do not understand what the parent filters are. +// +// Here is an example used to explain the reason. // Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to // convert b in ('1'). If we only convert a = 2, we will end up with a filter // NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. -for { - _ <- buildSearchArgument(dataTypeMap, left, newBuilder) - _ <- buildSearchArgument(dataTypeMap, right, newBuilder) - lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd()) - rhs <- buildSearchArgument(dataTypeMap, right, lhs) -} yield rhs.end() +// +// Pushing one side of AND down is only safe to do at the top level or in the child +// AND before hitting NOT or OR conditions, and in this case, the unsupported predicate +// can be safely removed. +val leftBuilderOption = createBuilder(dataTypeMap, left, + newBuilder, canPartialPushDownConjuncts) +val rightBuilderOption = + createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts) +(leftBuilderOption, rightBuilderOption) match { + case (Some(_), Some(_)) => +for { + lhs <- createBuilder(dataTypeMap, left, +builder.startAnd(), canPartialPushDownConjuncts) + rhs <- createBuilder(dataTypeMap, right, lhs, canPartialPushDownConjuncts) +} yield rhs.end() + + case (Some(_), None) if canPartialPushDownConjuncts => +createBuilder(dataTypeMap, left, builder, canPartialPushDownConjuncts) + + case (None, Some(_)) if canPartialPushDownConjuncts => +createBuilder(dataTypeMap, right, builder, canPartialPushDownConjuncts) + + case _ => None +} case Or(left, right) => for { - _ <- buildSearchArgument(dataTypeMap, left, newBuilder) - _ <- buildSearchArgument(dataTypeMap, right, newBuilder) - lhs <- buildSearchArgument(dataTypeMap, left, builder.startOr()) - rhs <- buildSearchArgument(dataTypeMap, right, lhs) + _ <- createBuilder(dataTypeMap, left, newBuilder, canPartialPushDownConjuncts = false) + _ <- createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts = false) + lhs <- createBuilder(dataTypeMap, left, +builder.startOr(), canPartialPushDownConjuncts = false) + rhs <- createBuilder(dataTypeMap, right, lhs, canPartialPushDownConjuncts = false) } yield rhs.end() case Not(child) => for { - _ <- buildSearchArgument(dataTypeMap, child, newBuilder) - negate <- buildSearchArgument(dataTypeMap, child, builder.startNot()) + _ <- createBuilder(dataTypeMap, child, newBuilder, canPartialPushDownConjuncts = false) + negate <- createBuilder(dataTypeMap, child, builder.startNot(), false) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22684#discussion_r224179447 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFilters.scala --- @@ -90,32 +107,51 @@ private[orc] object OrcFilters extends Logging { expression match { case And(left, right) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. +// At here, it is not safe to just convert one side and remove the other side +// if we do not understand what the parent filters are. +// +// Here is an example used to explain the reason. // Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to // convert b in ('1'). If we only convert a = 2, we will end up with a filter // NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. -for { - _ <- buildSearchArgument(dataTypeMap, left, newBuilder) - _ <- buildSearchArgument(dataTypeMap, right, newBuilder) - lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd()) - rhs <- buildSearchArgument(dataTypeMap, right, lhs) -} yield rhs.end() +// +// Pushing one side of AND down is only safe to do at the top level or in the child +// AND before hitting NOT or OR conditions, and in this case, the unsupported predicate +// can be safely removed. +val leftBuilderOption = createBuilder(dataTypeMap, left, + newBuilder, canPartialPushDownConjuncts) +val rightBuilderOption = + createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts) +(leftBuilderOption, rightBuilderOption) match { --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22684#discussion_r224178237 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -138,39 +138,75 @@ private[sql] object OrcFilters { dataTypeMap: Map[String, DataType], expression: Filter, builder: Builder): Option[Builder] = { +createBuilder(dataTypeMap, expression, builder, canPartialPushDownConjuncts = true) + } + + /** + * @param dataTypeMap a map from the attribute name to its data type. + * @param expression the input filter predicates. + * @param builder the input SearchArgument.Builder. + * @param canPartialPushDownConjuncts whether a subset of conjuncts of predicates can be pushed + *down safely. Pushing ONLY one side of AND down is safe to + *do at the top level or none of its ancestors is NOT and OR. + * @return the builder so far. + */ + private def createBuilder( + dataTypeMap: Map[String, DataType], + expression: Filter, + builder: Builder, + canPartialPushDownConjuncts: Boolean): Option[Builder] = { def getType(attribute: String): PredicateLeaf.Type = getPredicateLeafType(dataTypeMap(attribute)) import org.apache.spark.sql.sources._ expression match { case And(left, right) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. +// At here, it is not safe to just convert one side and remove the other side +// if we do not understand what the parent filters are. +// +// Here is an example used to explain the reason. // Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to // convert b in ('1'). If we only convert a = 2, we will end up with a filter // NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. -for { - _ <- buildSearchArgument(dataTypeMap, left, newBuilder) - _ <- buildSearchArgument(dataTypeMap, right, newBuilder) - lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd()) - rhs <- buildSearchArgument(dataTypeMap, right, lhs) -} yield rhs.end() +// +// Pushing one side of AND down is only safe to do at the top level or in the child +// AND before hitting NOT or OR conditions, and in this case, the unsupported predicate +// can be safely removed. +val leftBuilderOption = createBuilder(dataTypeMap, left, + newBuilder, canPartialPushDownConjuncts) +val rightBuilderOption = + createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts) +(leftBuilderOption, rightBuilderOption) match { + case (Some(_), Some(_)) => +for { + lhs <- createBuilder(dataTypeMap, left, +builder.startAnd(), canPartialPushDownConjuncts) + rhs <- createBuilder(dataTypeMap, right, lhs, canPartialPushDownConjuncts) +} yield rhs.end() + + case (Some(_), None) if canPartialPushDownConjuncts => +createBuilder(dataTypeMap, left, builder, canPartialPushDownConjuncts) + + case (None, Some(_)) if canPartialPushDownConjuncts => +createBuilder(dataTypeMap, right, builder, canPartialPushDownConjuncts) + + case _ => None +} case Or(left, right) => for { - _ <- buildSearchArgument(dataTypeMap, left, newBuilder) - _ <- buildSearchArgument(dataTypeMap, right, newBuilder) - lhs <- buildSearchArgument(dataTypeMap, left, builder.startOr()) - rhs <- buildSearchArgument(dataTypeMap, right, lhs) + _ <- createBuilder(dataTypeMap, left, newBuilder, canPartialPushDownConjuncts = false) + _ <- createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts = false) + lhs <- createBuilder(dataTypeMap, left, +builder.startOr(), canPartialPushDownConjuncts = false) + rhs <- createBuilder(dataTypeMap, right, lhs, canPartialPushDownConjuncts = false) } yield rhs.end() case Not(child) => for { - _ <- buildSearchArgument(dataTypeMa
[GitHub] spark pull request #22684: [SPARK-25699][SQL] Partially push down conjunctiv...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22684#discussion_r224174206 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -138,39 +138,75 @@ private[sql] object OrcFilters { dataTypeMap: Map[String, DataType], expression: Filter, builder: Builder): Option[Builder] = { +createBuilder(dataTypeMap, expression, builder, canPartialPushDownConjuncts = true) + } + + /** + * @param dataTypeMap a map from the attribute name to its data type. + * @param expression the input filter predicates. + * @param builder the input SearchArgument.Builder. + * @param canPartialPushDownConjuncts whether a subset of conjuncts of predicates can be pushed + *down safely. Pushing ONLY one side of AND down is safe to + *do at the top level or none of its ancestors is NOT and OR. + * @return the builder so far. + */ + private def createBuilder( + dataTypeMap: Map[String, DataType], + expression: Filter, + builder: Builder, + canPartialPushDownConjuncts: Boolean): Option[Builder] = { def getType(attribute: String): PredicateLeaf.Type = getPredicateLeafType(dataTypeMap(attribute)) import org.apache.spark.sql.sources._ expression match { case And(left, right) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. +// At here, it is not safe to just convert one side and remove the other side +// if we do not understand what the parent filters are. +// +// Here is an example used to explain the reason. // Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to // convert b in ('1'). If we only convert a = 2, we will end up with a filter // NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. -for { - _ <- buildSearchArgument(dataTypeMap, left, newBuilder) - _ <- buildSearchArgument(dataTypeMap, right, newBuilder) - lhs <- buildSearchArgument(dataTypeMap, left, builder.startAnd()) - rhs <- buildSearchArgument(dataTypeMap, right, lhs) -} yield rhs.end() +// +// Pushing one side of AND down is only safe to do at the top level or in the child +// AND before hitting NOT or OR conditions, and in this case, the unsupported predicate +// can be safely removed. +val leftBuilderOption = createBuilder(dataTypeMap, left, + newBuilder, canPartialPushDownConjuncts) +val rightBuilderOption = + createBuilder(dataTypeMap, right, newBuilder, canPartialPushDownConjuncts) --- End diff -- Can you make the format the same as `leftBuilderOption`? Also, add another empty line before `(leftBuilderOption, rightBuilderOption)`. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22679: [SPARK-25559] [FOLLOW-UP] Add comments for partial pushd...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22679 Thanks. Merged into master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22679: [SPARK-25559] [FOLLOW-UP] Add comments for partial pushd...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22679 LGTM. Wait for the PR build. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22574: [SPARK-25559][SQL] Remove the unsupported predicates in ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22574 I changed the title, and hopefully, it's much more clear now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22573 I was thinking to change the APIs in `Filter` so we can represent nested fields easier, but also realized that it's a stable public interface. Without changing the interface of `Filter`, we can have the following two options, 1. Use backtick to wrap around the column name and structure name containing dots. For example, ```scala `column.1`.`attribute.b` ``` It's also easier for people to understand when they are reading the pushdown plans in text format. 2. Alternatively, we can use ASCII delimited text to avoid delimiter collision, for example `\31` is commonly used between fields of a record, or members of a row. This simplifies parsing significantly, but the downside is that it's not readable, so when we print the plan, we need to add the backtick for visualization. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22574: [SPARK-25559][SQL] Just remove the unsupported pr...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22574#discussion_r221374514 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -488,26 +494,27 @@ private[parquet] class ParquetFilters( .map(_(nameToParquetField(name).fieldName, value)) case sources.And(lhs, rhs) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. -// Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to -// convert b in ('1'). If we only convert a = 2, we will end up with a filter -// NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. --- End diff -- addressed and added more tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22574: [SPARK-25559][SQL] Just remove the unsupported predicate...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22574 test this again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22574: [SPARK-25559][SQL] Just remove the unsupported pr...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22574#discussion_r221153414 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -488,26 +494,25 @@ private[parquet] class ParquetFilters( .map(_(nameToParquetField(name).fieldName, value)) case sources.And(lhs, rhs) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. -// Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to -// convert b in ('1'). If we only convert a = 2, we will end up with a filter -// NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. -for { - lhsFilter <- createFilter(schema, lhs) - rhsFilter <- createFilter(schema, rhs) -} yield FilterApi.and(lhsFilter, rhsFilter) +// If the unsupported predicate is in the top level `And` condition or in the child +// `And` condition before hitting `Not` or `Or` condition, it can be safely removed. +(createFilterHelper(nameToParquetField, lhs, canRemoveOneSideInAnd = true), --- End diff -- Addressed. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22574: [SPARK-25559][SQL] Just remove the unsupported pr...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22574#discussion_r221152340 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -488,26 +494,25 @@ private[parquet] class ParquetFilters( .map(_(nameToParquetField(name).fieldName, value)) case sources.And(lhs, rhs) => -// At here, it is not safe to just convert one side if we do not understand the -// other side. Here is an example used to explain the reason. -// Let's say we have NOT(a = 2 AND b in ('1')) and we do not understand how to -// convert b in ('1'). If we only convert a = 2, we will end up with a filter -// NOT(a = 2), which will generate wrong results. -// Pushing one side of AND down is only safe to do at the top level. -// You can see ParquetRelation's initializeLocalJobFunc method as an example. -for { - lhsFilter <- createFilter(schema, lhs) - rhsFilter <- createFilter(schema, rhs) -} yield FilterApi.and(lhsFilter, rhsFilter) +// If the unsupported predicate is in the top level `And` condition or in the child +// `And` condition before hitting `Not` or `Or` condition, it can be safely removed. +(createFilterHelper(nameToParquetField, lhs, canRemoveOneSideInAnd = true), + createFilterHelper(nameToParquetField, rhs, canRemoveOneSideInAnd = true)) match { --- End diff -- Thanks for catching this. I just fixed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22573: [SPARK-25558][SQL] Pushdown predicates for nested...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22573#discussion_r221128544 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -437,53 +436,65 @@ object DataSourceStrategy { * @return a `Some[Filter]` if the input [[Expression]] is convertible, otherwise a `None`. */ protected[sql] def translateFilter(predicate: Expression): Option[Filter] = { +// Recursively try to find an attribute name from the top level that can be pushed down. +def attrName(e: Expression): Option[String] = e match { + // In Spark and many data sources such as parquet, dots are used as a column path delimiter; + // thus, we don't translate such expressions. + case a: Attribute if !a.name.contains(".") => +Some(a.name) --- End diff -- Do we have any data source currently supporting `dot` in the column name with pushdown? The worst case will be no pushdown for those data sources. I know ORC doesn't work for now. We can have another followup PR to address this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22574: [SPARK-25556][SQL] Just remove the unsupported predicate...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22574 cc @gatorsmile @cloud-fan @HyukjinKwon @dongjoon-hyun @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22574: [SPARK-25556][SQL] Just remove the unsupported pr...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/22574 [SPARK-25556][SQL] Just remove the unsupported predicates in Parquet ## What changes were proposed in this pull request? Currently, in `ParquetFilters`, if one of the children predicates is not supported by Parquet, the entire predicates will be thrown away. In fact, if the unsupported predicate is in the top level `And` condition or in the child before hitting `Not` or `Or` condition, it can be safely removed. ## How was this patch tested? Tests are added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark removeUnsupportedPredicatesInParquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22574.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22574 commit d49d63bc40a7752990583f9afbd10c68025510b3 Author: DB Tsai Date: 2018-09-27T22:12:44Z Remove unsupported predicates in parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdown in ne...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22535 I'm breaking this PRs into three smaller PR. I'll fix the tests in those smaller PRs. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22573 @gatorsmile @cloud-fan @dongjoon-hyun @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22573: [SPARK-25558][SQL] Pushdown predicates for nested...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/22573 [SPARK-25558][SQL] Pushdown predicates for nested fields in DataSource Strategy ## What changes were proposed in this pull request? This PR allows Spark to create predicates for nested fields, and it's a building block to have Parquet and ORC to support the nested predicate pushdown. ## How was this patch tested? Tests added You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark dataSourcePredicate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22573.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22573 commit 2f21842d4676993d0d28abb6297796c672186f53 Author: DB Tsai Date: 2018-09-27T18:24:38Z DataSourceStrategy --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22535: [SPARK-17636][SQL][WIP] Parquet predicate pushdow...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/22535 [SPARK-17636][SQL][WIP] Parquet predicate pushdown in nested fields ## What changes were proposed in this pull request? Support Parquet predicate pushdown in nested fields ## How was this patch tested? Existing tests and new tests are added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark parquetNesting Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22535.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22535 commit c95706f60e4d576caca78a32000d4a7bbb12c141 Author: DB Tsai Date: 2018-09-06T00:22:09Z Nested parquet pushdown --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22418#discussion_r218272427 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -50,6 +55,66 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { .createOrReplaceTempView("orc_temp_table") } + protected def testBloomFilterCreation(bloomFilterKind: Kind) { +val tableName = "bloomFilter" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.bloom.filter.columns '*', + | orc.bloom.filter.fpp 0.1 + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.bloom.filter.columns='*', + | orc.bloom.filter.fpp=0.1 + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES (1, 'str')") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // BloomFilter array is created for all types; `struct`, int (`a`), string (`b`) + val sargColumns = Array(true, true, true) + val orcIndex = recordReader.readRowIndex(0, null, sargColumns) + + // Check the types and counts of bloom filters + assert(orcIndex.getBloomFilterKinds.forall(_ === bloomFilterKind)) --- End diff -- Something like ``` == Physical Plan == *(1) Project [_1#3] +- *(1) Filter (isnotnull(_1#3) && (_1#3._1 = true)) +- *(1) FileScan parquet [_1#3] Batched: false, Format: Orc, PushedFilters: [IsNotNull(_1), EqualTo(_1._1,true)] BloomFilters: [some information] ``` Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22418: [SPARK-25427][SQL][TEST] Add BloomFilter creation...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22418#discussion_r218158845 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala --- @@ -50,6 +55,66 @@ abstract class OrcSuite extends OrcTest with BeforeAndAfterAll { .createOrReplaceTempView("orc_temp_table") } + protected def testBloomFilterCreation(bloomFilterKind: Kind) { +val tableName = "bloomFilter" + +withTempDir { dir => + withTable(tableName) { +val sqlStatement = orcImp match { + case "native" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |USING ORC + |OPTIONS ( + | path '${dir.toURI}', + | orc.bloom.filter.columns '*', + | orc.bloom.filter.fpp 0.1 + |) +""".stripMargin + case "hive" => +s""" + |CREATE TABLE $tableName (a INT, b STRING) + |STORED AS ORC + |LOCATION '${dir.toURI}' + |TBLPROPERTIES ( + | orc.bloom.filter.columns='*', + | orc.bloom.filter.fpp=0.1 + |) +""".stripMargin + case impl => +throw new UnsupportedOperationException(s"Unknown ORC implementation: $impl") +} + +sql(sqlStatement) +sql(s"INSERT INTO $tableName VALUES (1, 'str')") + +val partFiles = dir.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +val orcFilePath = new Path(partFiles.head.getAbsolutePath) +val readerOptions = OrcFile.readerOptions(new Configuration()) +val reader = OrcFile.createReader(orcFilePath, readerOptions) +var recordReader: RecordReaderImpl = null +try { + recordReader = reader.rows.asInstanceOf[RecordReaderImpl] + + // BloomFilter array is created for all types; `struct`, int (`a`), string (`b`) + val sargColumns = Array(true, true, true) + val orcIndex = recordReader.readRowIndex(0, null, sargColumns) + + // Check the types and counts of bloom filters + assert(orcIndex.getBloomFilterKinds.forall(_ === bloomFilterKind)) --- End diff -- It seems the test here is creating orc with bloom filter using spark with options, and read it back through native ORC reader. Is there a plan to add an optimizer rule in Spark to show the functionality of this in the physical plan like predicate pushdown in parquet? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22431: [SPARK-24418][FOLLOWUP][DOC] Update docs to show Scala 2...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22431 Thanks! Merged into both branch 2.4 and master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22394: [SPARK-25406][SQL] For ParquetSchemaPruningSuite.scala, ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22394 LGTM. Merged into master and branch 2.4. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22409: [SPARK-25352][SQL][Followup] Add helper method and addre...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22409 LGTM. Wait for the test. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22344: [SPARK-25352][SQL] Perform ordered global limit w...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22344#discussion_r217128163 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -68,22 +68,42 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { object SpecialLimits extends Strategy { override def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { case ReturnAnswer(rootPlan) => rootPlan match { -case Limit(IntegerLiteral(limit), Sort(order, true, child)) -if limit < conf.topKSortFallbackThreshold => - TakeOrderedAndProjectExec(limit, order, child.output, planLater(child)) :: Nil -case Limit(IntegerLiteral(limit), Project(projectList, Sort(order, true, child))) -if limit < conf.topKSortFallbackThreshold => - TakeOrderedAndProjectExec(limit, order, projectList, planLater(child)) :: Nil +case Limit(IntegerLiteral(limit), s@Sort(order, true, child)) => + if (limit < conf.topKSortFallbackThreshold) { --- End diff -- Also, please add `space` in-between s and `@`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22357 Thanks all again. Merged into 2.4 branch and master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22357 LGTM. Thank you all for participating the discussion. @cloud-fan and @gatorsmile, do you have any further comment? If not, I would like to merge it tomorrow into both master and rc branch as it's an important performance fix for schema pruning. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216776055 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- Instead of reading any arbitrary field of simple type (which may not exist if it's a deeply nested struct), I think we should implement the pushdown with complex type in parquet with similar logic, and let parquet reader handle it. @viirya Can you create a followup JIRA for this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22357 FYI, @mallman I'm working on having `ParquetFilter` to support `IsNotNull(employer.id)` to be pushed into parquet reader. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216559045 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- For the first query, the constrain is `employer is not null`. When `employer.id` is not `null`, `employer` will always not be `null`; as a result, this PR will work. However, when `employer.id` is `null`, `employer` can be `null` or `something`, so we need to check if `employer` is `something` to return a null of `employer.id`. I checked in the `ParquetFilter`, `IsNotNull(employer)` will be ignored since it's not a valid parquet filter as parquet doesn't support pushdown on the struct; thus, with this PR, this query will return wrong answer. I think in this scenario, as @mallman suggested, we might need to read the full data. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216204022 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query1 = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query1, "struct>") +checkAnswer(query1, Row("Jane") :: Nil) + +val query2 = sql("select name.first, name.last from contacts where name.first = 'Jane'") +checkScan(query2, "struct>") +checkAnswer(query2, Row("Jane", "Doe") :: Nil) + +val query3 = sql("select name.first from contacts " + + "where employer.company.name = 'abc' and p = 1") --- End diff -- Let's say a user adds `where employer.company is not null`, can we still read schema with `employer:struct>>` as we only mark `contentAccessed = false` when `IsNotNull` is on an attribute? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216202879 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,12 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +val (rootFields, optRootFields) = (projectionRootFields ++ filterRootFields) + .distinct.partition(_.contentAccessed) --- End diff -- Some comments here please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/22357 cc @beettlle --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org