[GitHub] spark issue #21321: [SPARK-24268][SQL] Use datatype.simpleString in error me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21321 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21299: [SPARK-24250][SQL] support accessing SQLConf inside task...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21299 **[Test build #90590 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90590/testReport)** for PR 21299 at commit [`01e288a`](https://github.com/apache/spark/commit/01e288a332c25dcb9cd5af8c818ae7018dd6a9bb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r187946945 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala --- @@ -136,6 +136,59 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(ArrayContains(a3, Literal.create(null, StringType)), null) } + test("ArraysOverlap") { +val a0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType)) +val a1 = Literal.create(Seq(4, 5, 3), ArrayType(IntegerType)) +val a2 = Literal.create(Seq(null, 5, 6), ArrayType(IntegerType)) +val a3 = Literal.create(Seq(7, 8), ArrayType(IntegerType)) +val a4 = Literal.create(Seq.empty[Int], ArrayType(IntegerType)) + +val a5 = Literal.create(Seq[String](null, ""), ArrayType(StringType)) +val a6 = Literal.create(Seq[String]("", "abc"), ArrayType(StringType)) +val a7 = Literal.create(Seq[String]("def", "ghi"), ArrayType(StringType)) + +checkEvaluation(ArraysOverlap(a0, a1), true) +checkEvaluation(ArraysOverlap(a0, a2), null) +checkEvaluation(ArraysOverlap(a1, a2), true) +checkEvaluation(ArraysOverlap(a1, a3), false) +checkEvaluation(ArraysOverlap(a0, a4), false) +checkEvaluation(ArraysOverlap(a2, a4), null) +checkEvaluation(ArraysOverlap(a4, a2), null) + +checkEvaluation(ArraysOverlap(a5, a6), true) +checkEvaluation(ArraysOverlap(a5, a7), null) +checkEvaluation(ArraysOverlap(a6, a7), false) + +// null handling +checkEvaluation(ArraysOverlap(Literal.create(null, ArrayType(IntegerType)), a0), null) +checkEvaluation(ArraysOverlap(a0, Literal.create(null, ArrayType(IntegerType))), null) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(null), ArrayType(IntegerType)), + Literal.create(Seq(null), ArrayType(IntegerType))), null) --- End diff -- This case is covered by https://github.com/apache/spark/pull/21028/files#diff-d31eca9f1c4c33104dc2cb8950486910R163 for instance. Anyway, I am adding another on which is exactly this one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20933: [SPARK-23817][SQL]Migrate ORC file format read path to d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20933 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3198/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20933: [SPARK-23817][SQL]Migrate ORC file format read path to d...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20933 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20894 **[Test build #90589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90589/testReport)** for PR 20894 at commit [`21f8b10`](https://github.com/apache/spark/commit/21f8b10dda4b0ef71ba69cc6147d1cf8614812f1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21246 **[Test build #90588 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90588/testReport)** for PR 21246 at commit [`06b8b6c`](https://github.com/apache/spark/commit/06b8b6c9f4e7d82608d0507860b0dbe7acb6af54). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21291 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3197/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21299: [SPARK-24250][SQL] support accessing SQLConf inside task...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21299 **[Test build #90587 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90587/testReport)** for PR 21299 at commit [`a100dea`](https://github.com/apache/spark/commit/a100dea9573e9b43b993516c817e306d80f72d29). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21290: [SPARK-24241][Submit]Do not fail fast when dynami...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21290#discussion_r187943795 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -76,6 +75,7 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S var proxyUser: String = null var principal: String = null var keytab: String = null + private var dynamicAllocationEnabled: String = null --- End diff -- Wait, why not simply a boolean here? it's going to be much simpler to write. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3196/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21299: [SPARK-24250][SQL] support accessing SQLConf inside task...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21299 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3099/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21246: [SPARK-23901][SQL] Add masking functions
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21246#discussion_r187941913 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/maskExpressions.scala --- @@ -0,0 +1,569 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.commons.codec.digest.DigestUtils + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.expressions.MaskExpressionsUtils._ +import org.apache.spark.sql.catalyst.expressions.MaskLike._ +import org.apache.spark.sql.catalyst.expressions.codegen.{CodegenContext, CodeGenerator, ExprCode} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + + +trait MaskLike { + def upper: String + def lower: String + def digit: String + + protected lazy val upperReplacement: Int = getReplacementChar(upper, defaultMaskedUppercase) + protected lazy val lowerReplacement: Int = getReplacementChar(lower, defaultMaskedLowercase) + protected lazy val digitReplacement: Int = getReplacementChar(digit, defaultMaskedDigit) + + protected val maskUtilsClassName: String = classOf[MaskExpressionsUtils].getName + + def inputStringLengthCode(inputString: String, length: String): String = { +s"${CodeGenerator.JAVA_INT} $length = $inputString.codePointCount(0, $inputString.length());" + } + + def appendMaskedToStringBuilderCode( + ctx: CodegenContext, + sb: String, + inputString: String, + offset: String, + numChars: String): String = { +val i = ctx.freshName("i") +val codePoint = ctx.freshName("codePoint") +s""" + |for (${CodeGenerator.JAVA_INT} $i = 0; $i < $numChars; $i++) { + | ${CodeGenerator.JAVA_INT} $codePoint = $inputString.codePointAt($offset); + | $sb.appendCodePoint($maskUtilsClassName.transformChar($codePoint, + |$upperReplacement, $lowerReplacement, + |$digitReplacement, $defaultMaskedOther)); + | $offset += Character.charCount($codePoint); + |} + """.stripMargin + } + + def appendUnchangedToStringBuilderCode( + ctx: CodegenContext, + sb: String, + inputString: String, + offset: String, + numChars: String): String = { +val i = ctx.freshName("i") +val codePoint = ctx.freshName("codePoint") +s""" + |for (${CodeGenerator.JAVA_INT} $i = 0; $i < $numChars; $i++) { + | ${CodeGenerator.JAVA_INT} $codePoint = $inputString.codePointAt($offset); + | $sb.appendCodePoint($codePoint); + | $offset += Character.charCount($codePoint); + |} + """.stripMargin + } + + def appendMaskedToStringBuffer( + sb: StringBuffer, + inputString: String, + startOffset: Int, + numChars: Int): Int = { +var offset = startOffset +(1 to numChars) foreach { _ => + val codePoint = inputString.codePointAt(offset) + sb.appendCodePoint(transformChar( +codePoint, +upperReplacement, +lowerReplacement, +digitReplacement, +defaultMaskedOther)) + offset += Character.charCount(codePoint) +} +offset + } + + def appendUnchangedToStringBuffer( + sb: StringBuffer, + inputString: String, + startOffset: Int, + numChars: Int): Int = { +var offset = startOffset +(1 to numChars) foreach { _ => + val codePoint = inputString.codePointAt(offset) + sb.appendCodePoint(codePoint) + offset += Character.charCount(codePoint) +} +offset + } +} + +trait MaskLikeWithN extends MaskLike { + def n: Int + protected lazy val charCount: Int = if (n < 0) 0 else n +} + +/** + * Utils for mask op
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3195/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21246: [SPARK-23901][SQL] Add masking functions
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21246#discussion_r187940989 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/MaskExpressionsUtils.java --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions; + +/** + * Contains all the Utils methods used in the masking expressions. + */ +public class MaskExpressionsUtils { --- End diff -- Because I am invoking also in the Java code generated and I wanted to avoid using the match clause (instead of the switch java operation) for performance reasons. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90576/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90578/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 **[Test build #90576 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90576/testReport)** for PR 21317 at commit [`aa0ccc0`](https://github.com/apache/spark/commit/aa0ccc02bc080b067519834ebb2657dd54208c37). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 **[Test build #90578 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90578/testReport)** for PR 21317 at commit [`aa0ccc0`](https://github.com/apache/spark/commit/aa0ccc02bc080b067519834ebb2657dd54208c37). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3099/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3194/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90579 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90579/testReport)** for PR 21266 at commit [`fc96adb`](https://github.com/apache/spark/commit/fc96adb099380ffac09d445461435b613e08f9f3). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90579/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21165: [Spark-20087][CORE] Attach accumulators / metrics to 'Ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21165 **[Test build #90585 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90585/testReport)** for PR 21165 at commit [`05d1d9c`](https://github.com/apache/spark/commit/05d1d9cad761bb09e1131162458fecd5e34f02d2). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21165: [Spark-20087][CORE] Attach accumulators / metrics to 'Ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21165 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90585/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21165: [Spark-20087][CORE] Attach accumulators / metrics to 'Ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21165 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21321: [SPARK-24268][SQL] Use datatype.simpleString in error me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21321 **[Test build #90586 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90586/testReport)** for PR 21321 at commit [`ada7667`](https://github.com/apache/spark/commit/ada7667a9f8872f373bc789a7eb0c84987642314). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21321: [SPARK-24268][SQL] Use datatype.simpleString in e...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/21321 [SPARK-24268][SQL] Use datatype.simpleString in error messages ## What changes were proposed in this pull request? SPARK-22893 tried to unify error messages about dataTypes. Unfortunately, still many places were missing the `simpleString` method in other to have the same representation everywhere. The PR unified the messages using alway the simpleString representation of the dataTypes in the messages. ## How was this patch tested? existing/modified UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-24268 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21321.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21321 commit ada7667a9f8872f373bc789a7eb0c84987642314 Author: Marco Gaido Date: 2018-05-05T15:19:45Z [SPARK-24268][SQL] Use datatype.simpleString in error messages --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20894 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20894 **[Test build #90581 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90581/testReport)** for PR 20894 at commit [`2bd2713`](https://github.com/apache/spark/commit/2bd27136ae9095beec429ff15a6a5f1be0464419). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20894 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90581/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20894: [SPARK-23786][SQL] Checking column names of csv headers
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20894 **[Test build #90581 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90581/testReport)** for PR 20894 at commit [`2bd2713`](https://github.com/apache/spark/commit/2bd27136ae9095beec429ff15a6a5f1be0464419). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21114: [SPARK-22371][CORE] Return None instead of throwing an e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21114 **[Test build #90577 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90577/testReport)** for PR 21114 at commit [`8b30733`](https://github.com/apache/spark/commit/8b30733dba85d9881d0171414616bd0b0893f419). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20933: [SPARK-23817][SQL]Migrate ORC file format read path to d...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20933 **[Test build #90584 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90584/testReport)** for PR 20933 at commit [`67b1748`](https://github.com/apache/spark/commit/67b1748c8b939a6b484bfc868fd311e381d7f8e0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #90582 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90582/testReport)** for PR 21320 at commit [`9e301b3`](https://github.com/apache/spark/commit/9e301b37bb5863ef5fad8ec15f1d8f3d4b5e6a6f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21311 **[Test build #90575 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90575/testReport)** for PR 21311 at commit [`d9d8e62`](https://github.com/apache/spark/commit/d9d8e62c2de7d9d04534396ab3bbf984ab16c7f5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 **[Test build #90578 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90578/testReport)** for PR 21317 at commit [`aa0ccc0`](https://github.com/apache/spark/commit/aa0ccc02bc080b067519834ebb2657dd54208c37). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21165: [Spark-20087][CORE] Attach accumulators / metrics to 'Ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21165 **[Test build #90585 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90585/testReport)** for PR 21165 at commit [`05d1d9c`](https://github.com/apache/spark/commit/05d1d9cad761bb09e1131162458fecd5e34f02d2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19293 **[Test build #90580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90580/testReport)** for PR 19293 at commit [`45477fb`](https://github.com/apache/spark/commit/45477fbf00558066e3733a34e1d59ce22c192ee2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 **[Test build #90576 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90576/testReport)** for PR 21317 at commit [`aa0ccc0`](https://github.com/apache/spark/commit/aa0ccc02bc080b067519834ebb2657dd54208c37). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21266 **[Test build #90579 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90579/testReport)** for PR 21266 at commit [`fc96adb`](https://github.com/apache/spark/commit/fc96adb099380ffac09d445461435b613e08f9f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21311 **[Test build #90574 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90574/testReport)** for PR 21311 at commit [`22a2767`](https://github.com/apache/spark/commit/22a2767b98185edf32be3c36bb255f5837ad7466). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21291 **[Test build #90583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90583/testReport)** for PR 21291 at commit [`3a14bd6`](https://github.com/apache/spark/commit/3a14bd6eeb390dfe0940eb17b5c9988e12aba1bc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r187935923 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol( } override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = { +// first delete the should delete special file +val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration) --- End diff -- can we change other places in this method to use the `fs` created here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21319 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90573/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21319 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21319 **[Test build #90573 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90573/testReport)** for PR 21319 at commit [`ca6ccb2`](https://github.com/apache/spark/commit/ca6ccb24e8f2910a3ffc07a790f2ba7f57e79056). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21317 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r187935251 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala --- @@ -136,6 +136,59 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper checkEvaluation(ArrayContains(a3, Literal.create(null, StringType)), null) } + test("ArraysOverlap") { +val a0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType)) +val a1 = Literal.create(Seq(4, 5, 3), ArrayType(IntegerType)) +val a2 = Literal.create(Seq(null, 5, 6), ArrayType(IntegerType)) +val a3 = Literal.create(Seq(7, 8), ArrayType(IntegerType)) +val a4 = Literal.create(Seq.empty[Int], ArrayType(IntegerType)) + +val a5 = Literal.create(Seq[String](null, ""), ArrayType(StringType)) +val a6 = Literal.create(Seq[String]("", "abc"), ArrayType(StringType)) +val a7 = Literal.create(Seq[String]("def", "ghi"), ArrayType(StringType)) + +checkEvaluation(ArraysOverlap(a0, a1), true) +checkEvaluation(ArraysOverlap(a0, a2), null) +checkEvaluation(ArraysOverlap(a1, a2), true) +checkEvaluation(ArraysOverlap(a1, a3), false) +checkEvaluation(ArraysOverlap(a0, a4), false) +checkEvaluation(ArraysOverlap(a2, a4), null) +checkEvaluation(ArraysOverlap(a4, a2), null) + +checkEvaluation(ArraysOverlap(a5, a6), true) +checkEvaluation(ArraysOverlap(a5, a7), null) +checkEvaluation(ArraysOverlap(a6, a7), false) + +// null handling +checkEvaluation(ArraysOverlap(Literal.create(null, ArrayType(IntegerType)), a0), null) +checkEvaluation(ArraysOverlap(a0, Literal.create(null, ArrayType(IntegerType))), null) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(null), ArrayType(IntegerType)), + Literal.create(Seq(null), ArrayType(IntegerType))), null) --- End diff -- do we have a test case for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r187934895 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -529,6 +567,239 @@ case class ArrayContains(left: Expression, right: Expression) override def prettyName: String = "array_contains" } +/** + * Checks if the two arrays contain at least one common element. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(3, 4, 5)); + true + """, since = "2.4.0") +// scalastyle:off line.size.limit +case class ArraysOverlap(left: Expression, right: Expression) + extends BinaryArrayExpressionWithImplicitCast { + + override def checkInputDataTypes(): TypeCheckResult = super.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => + if (RowOrdering.isOrderable(elementType)) { +TypeCheckResult.TypeCheckSuccess + } else { +TypeCheckResult.TypeCheckFailure(s"${elementType.simpleString} cannot be used in comparison.") + } +case failure => failure + } + + @transient private lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient private lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } + + @transient private lazy val doEvaluation = if (elementTypeSupportEquals) { +fastEval _ + } else { +bruteForceEval _ + } + + override def dataType: DataType = BooleanType + + override def nullable: Boolean = { +left.nullable || right.nullable || left.dataType.asInstanceOf[ArrayType].containsNull || + right.dataType.asInstanceOf[ArrayType].containsNull + } + + override def nullSafeEval(a1: Any, a2: Any): Any = { +doEvaluation(a1.asInstanceOf[ArrayData], a2.asInstanceOf[ArrayData]) + } + + /** + * A fast implementation which puts all the elements from the smaller array in a set + * and then performs a lookup on it for each element of the bigger one. + * This eval mode works only for data types which implements properly the equals method. + */ + private def fastEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +val (bigger, smaller, biggerDt) = if (arr1.numElements() > arr2.numElements()) { + (arr1, arr2, left.dataType.asInstanceOf[ArrayType]) +} else { + (arr2, arr1, right.dataType.asInstanceOf[ArrayType]) +} +if (smaller.numElements() > 0) { + val smallestSet = new mutable.HashSet[Any] + smaller.foreach(elementType, (_, v) => +if (v == null) { + hasNull = true +} else { + smallestSet += v +}) + bigger.foreach(elementType, (_, v1) => +if (v1 == null) { + hasNull = true +} else if (smallestSet.contains(v1)) { + return true +} + ) +} +if (hasNull) { + null +} else { + false +} + } + + /** + * A slower evaluation which performs a nested loop and supports all the data types. + */ + private def bruteForceEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +if (arr1.numElements() > 0) { + arr1.foreach(elementType, (_, v1) => +if (v1 == null) { + hasNull = true +} else { + arr2.foreach(elementType, (_, v2) => +if (v1 == null) { + hasNull = true +} else if (ordering.equiv(v1, v2)) { + return true +} + ) +}) +} +if (hasNull) { + null +} else { + false +} + } + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +nullSafeCodeGen(ctx, ev, (a1, a2) => { + val smaller = ctx.freshName("smallerArray") + val bigger = ctx.freshName("biggerArray") + val comparisonCode = if (elementTypeSupportEquals) { +fastCodegen(ctx, ev, smaller, bigger) + } else { +bruteForceCodegen(ctx, ev, smaller, bigger) + } + s""" + |ArrayData $smaller; + |Arr
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r187934593 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -529,6 +567,239 @@ case class ArrayContains(left: Expression, right: Expression) override def prettyName: String = "array_contains" } +/** + * Checks if the two arrays contain at least one common element. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(3, 4, 5)); + true + """, since = "2.4.0") +// scalastyle:off line.size.limit +case class ArraysOverlap(left: Expression, right: Expression) + extends BinaryArrayExpressionWithImplicitCast { + + override def checkInputDataTypes(): TypeCheckResult = super.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => + if (RowOrdering.isOrderable(elementType)) { +TypeCheckResult.TypeCheckSuccess + } else { +TypeCheckResult.TypeCheckFailure(s"${elementType.simpleString} cannot be used in comparison.") + } +case failure => failure + } + + @transient private lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient private lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } + + @transient private lazy val doEvaluation = if (elementTypeSupportEquals) { +fastEval _ + } else { +bruteForceEval _ + } + + override def dataType: DataType = BooleanType + + override def nullable: Boolean = { +left.nullable || right.nullable || left.dataType.asInstanceOf[ArrayType].containsNull || + right.dataType.asInstanceOf[ArrayType].containsNull + } + + override def nullSafeEval(a1: Any, a2: Any): Any = { +doEvaluation(a1.asInstanceOf[ArrayData], a2.asInstanceOf[ArrayData]) + } + + /** + * A fast implementation which puts all the elements from the smaller array in a set + * and then performs a lookup on it for each element of the bigger one. + * This eval mode works only for data types which implements properly the equals method. + */ + private def fastEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +val (bigger, smaller, biggerDt) = if (arr1.numElements() > arr2.numElements()) { + (arr1, arr2, left.dataType.asInstanceOf[ArrayType]) +} else { + (arr2, arr1, right.dataType.asInstanceOf[ArrayType]) +} +if (smaller.numElements() > 0) { + val smallestSet = new mutable.HashSet[Any] + smaller.foreach(elementType, (_, v) => +if (v == null) { + hasNull = true +} else { + smallestSet += v +}) + bigger.foreach(elementType, (_, v1) => +if (v1 == null) { + hasNull = true +} else if (smallestSet.contains(v1)) { + return true +} + ) +} +if (hasNull) { + null +} else { + false +} + } + + /** + * A slower evaluation which performs a nested loop and supports all the data types. + */ + private def bruteForceEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +if (arr1.numElements() > 0) { + arr1.foreach(elementType, (_, v1) => +if (v1 == null) { + hasNull = true +} else { + arr2.foreach(elementType, (_, v2) => +if (v1 == null) { + hasNull = true +} else if (ordering.equiv(v1, v2)) { + return true +} + ) +}) +} +if (hasNull) { + null +} else { + false +} + } + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +nullSafeCodeGen(ctx, ev, (a1, a2) => { + val smaller = ctx.freshName("smallerArray") + val bigger = ctx.freshName("biggerArray") + val comparisonCode = if (elementTypeSupportEquals) { +fastCodegen(ctx, ev, smaller, bigger) + } else { +bruteForceCodegen(ctx, ev, smaller, bigger) + } + s""" + |ArrayData $smaller; + |Arr
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r187932792 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -529,6 +567,239 @@ case class ArrayContains(left: Expression, right: Expression) override def prettyName: String = "array_contains" } +/** + * Checks if the two arrays contain at least one common element. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(3, 4, 5)); + true + """, since = "2.4.0") +// scalastyle:off line.size.limit +case class ArraysOverlap(left: Expression, right: Expression) + extends BinaryArrayExpressionWithImplicitCast { + + override def checkInputDataTypes(): TypeCheckResult = super.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => + if (RowOrdering.isOrderable(elementType)) { +TypeCheckResult.TypeCheckSuccess + } else { +TypeCheckResult.TypeCheckFailure(s"${elementType.simpleString} cannot be used in comparison.") + } +case failure => failure + } + + @transient private lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient private lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } + + @transient private lazy val doEvaluation = if (elementTypeSupportEquals) { +fastEval _ + } else { +bruteForceEval _ + } + + override def dataType: DataType = BooleanType + + override def nullable: Boolean = { +left.nullable || right.nullable || left.dataType.asInstanceOf[ArrayType].containsNull || + right.dataType.asInstanceOf[ArrayType].containsNull + } + + override def nullSafeEval(a1: Any, a2: Any): Any = { +doEvaluation(a1.asInstanceOf[ArrayData], a2.asInstanceOf[ArrayData]) + } + + /** + * A fast implementation which puts all the elements from the smaller array in a set + * and then performs a lookup on it for each element of the bigger one. + * This eval mode works only for data types which implements properly the equals method. + */ + private def fastEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +val (bigger, smaller, biggerDt) = if (arr1.numElements() > arr2.numElements()) { + (arr1, arr2, left.dataType.asInstanceOf[ArrayType]) +} else { + (arr2, arr1, right.dataType.asInstanceOf[ArrayType]) +} +if (smaller.numElements() > 0) { + val smallestSet = new mutable.HashSet[Any] + smaller.foreach(elementType, (_, v) => +if (v == null) { + hasNull = true +} else { + smallestSet += v +}) + bigger.foreach(elementType, (_, v1) => +if (v1 == null) { + hasNull = true +} else if (smallestSet.contains(v1)) { + return true +} + ) +} +if (hasNull) { + null +} else { + false +} + } + + /** + * A slower evaluation which performs a nested loop and supports all the data types. + */ + private def bruteForceEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +if (arr1.numElements() > 0) { + arr1.foreach(elementType, (_, v1) => +if (v1 == null) { + hasNull = true +} else { + arr2.foreach(elementType, (_, v2) => +if (v1 == null) { + hasNull = true +} else if (ordering.equiv(v1, v2)) { + return true +} + ) +}) +} +if (hasNull) { + null +} else { + false +} + } + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +nullSafeCodeGen(ctx, ev, (a1, a2) => { + val smaller = ctx.freshName("smallerArray") + val bigger = ctx.freshName("biggerArray") + val comparisonCode = if (elementTypeSupportEquals) { +fastCodegen(ctx, ev, smaller, bigger) + } else { +bruteForceCodegen(ctx, ev, smaller, bigger) + } + s""" + |ArrayData $smaller; + |Arr
[GitHub] spark pull request #21028: [SPARK-23922][SQL] Add arrays_overlap function
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21028#discussion_r187931865 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -529,6 +567,239 @@ case class ArrayContains(left: Expression, right: Expression) override def prettyName: String = "array_contains" } +/** + * Checks if the two arrays contain at least one common element. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(a1, a2) - Returns true if a1 contains at least a non-null element present also in a2. If the arrays have no common element and they are both non-empty and either of them contains a null element null is returned, false otherwise.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array(3, 4, 5)); + true + """, since = "2.4.0") +// scalastyle:off line.size.limit +case class ArraysOverlap(left: Expression, right: Expression) + extends BinaryArrayExpressionWithImplicitCast { + + override def checkInputDataTypes(): TypeCheckResult = super.checkInputDataTypes() match { +case TypeCheckResult.TypeCheckSuccess => + if (RowOrdering.isOrderable(elementType)) { +TypeCheckResult.TypeCheckSuccess + } else { +TypeCheckResult.TypeCheckFailure(s"${elementType.simpleString} cannot be used in comparison.") + } +case failure => failure + } + + @transient private lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(elementType) + + @transient private lazy val elementTypeSupportEquals = elementType match { +case BinaryType => false +case _: AtomicType => true +case _ => false + } + + @transient private lazy val doEvaluation = if (elementTypeSupportEquals) { +fastEval _ + } else { +bruteForceEval _ + } + + override def dataType: DataType = BooleanType + + override def nullable: Boolean = { +left.nullable || right.nullable || left.dataType.asInstanceOf[ArrayType].containsNull || + right.dataType.asInstanceOf[ArrayType].containsNull + } + + override def nullSafeEval(a1: Any, a2: Any): Any = { +doEvaluation(a1.asInstanceOf[ArrayData], a2.asInstanceOf[ArrayData]) + } + + /** + * A fast implementation which puts all the elements from the smaller array in a set + * and then performs a lookup on it for each element of the bigger one. + * This eval mode works only for data types which implements properly the equals method. + */ + private def fastEval(arr1: ArrayData, arr2: ArrayData): Any = { +var hasNull = false +val (bigger, smaller, biggerDt) = if (arr1.numElements() > arr2.numElements()) { --- End diff -- the `biggerDt` is not used --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21316#discussion_r187931208 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1607,7 +1607,9 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def reduce(func: (T, T) => T): T = rdd.reduce(func) + def reduce(func: (T, T) => T): T = withNewExecutionId { --- End diff -- `reduce` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90571/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user zheh12 commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r187930560 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol( } override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = { +// first delete the should delete special file +val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration) --- End diff -- StagingDir is not always be valid hadoop path, but the JobContext work dir always be. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21288 **[Test build #90571 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90571/testReport)** for PR 21288 at commit [`4520044`](https://github.com/apache/spark/commit/4520044d3be40ba8bf963a151db2dd9769c0f59a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21299: [SPARK-24250][SQL] support accessing SQLConf inside task...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21299 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90570/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21299: [SPARK-24250][SQL] support accessing SQLConf inside task...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21299 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21299: [SPARK-24250][SQL] support accessing SQLConf inside task...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21299 **[Test build #90570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90570/testReport)** for PR 21299 at commit [`a100dea`](https://github.com/apache/spark/commit/a100dea9573e9b43b993516c817e306d80f72d29). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user zheh12 commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r187929156 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -898,4 +898,12 @@ object DDLUtils { "Cannot overwrite a path that is also being read from.") } } + + def verifyReadPath(query: LogicalPlan, outputPath: Path): Boolean = { --- End diff -- isInReadPath or inReadPath or isReadPath better? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20701 kindly ping @holdenk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21260: [SPARK-23529][K8s] Support mounting hostPath volumes
Github user andrusha commented on the issue: https://github.com/apache/spark/pull/21260 @liyinan926 sounds fair, it also pending tests, I'll add those in todo list for this PR and ping you once its done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21290: [SPARK-24241][Submit]Do not fail fast when dynamic resou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21290 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90569/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21290: [SPARK-24241][Submit]Do not fail fast when dynamic resou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21290 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21290: [SPARK-24241][Submit]Do not fail fast when dynamic resou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21290 **[Test build #90569 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90569/testReport)** for PR 21290 at commit [`17fa3bc`](https://github.com/apache/spark/commit/17fa3bc1011db45c346dac4c5930258810fc28d1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate the new s...
Github user cxzl25 commented on the issue: https://github.com/apache/spark/pull/21311 Thanks for your review. @maropu @kiszk @cloud-fan I submitted a modification including the following: 1. spliting append func into two parts:grow/appendG 2. doubling the size when growing 3. sys.error instead of UnsupportedOperationException --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...
Github user cxzl25 commented on a diff in the pull request: https://github.com/apache/spark/pull/21311#discussion_r187907410 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -568,13 +568,16 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap } // There is 8 bytes for the pointer to next value -if (cursor + 8 + row.getSizeInBytes > page.length * 8L + Platform.LONG_ARRAY_OFFSET) { +val needSize = cursor + 8 + row.getSizeInBytes +val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET +if (needSize > nowSize) { val used = page.length if (used >= (1 << 30)) { sys.error("Can not build a HashedRelation that is larger than 8G") --- End diff -- ok. sys.error instead of UnsupportedOperationException --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...
Github user cxzl25 commented on a diff in the pull request: https://github.com/apache/spark/pull/21311#discussion_r187907559 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -568,13 +568,16 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap } // There is 8 bytes for the pointer to next value -if (cursor + 8 + row.getSizeInBytes > page.length * 8L + Platform.LONG_ARRAY_OFFSET) { +val needSize = cursor + 8 + row.getSizeInBytes +val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET +if (needSize > nowSize) { val used = page.length if (used >= (1 << 30)) { sys.error("Can not build a HashedRelation that is larger than 8G") } - ensureAcquireMemory(used * 8L * 2) - val newPage = new Array[Long](used * 2) + val multiples = math.max(math.ceil(needSize.toDouble / (used * 8L)).toInt, 2) + ensureAcquireMemory(used * 8L * multiples) --- End diff -- ok.Spliting append func into two parts: grow/append. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21311: [SPARK-24257][SQL]LongToUnsafeRowMap calculate th...
Github user cxzl25 commented on a diff in the pull request: https://github.com/apache/spark/pull/21311#discussion_r187907473 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -568,13 +568,16 @@ private[execution] final class LongToUnsafeRowMap(val mm: TaskMemoryManager, cap } // There is 8 bytes for the pointer to next value -if (cursor + 8 + row.getSizeInBytes > page.length * 8L + Platform.LONG_ARRAY_OFFSET) { +val needSize = cursor + 8 + row.getSizeInBytes +val nowSize = page.length * 8L + Platform.LONG_ARRAY_OFFSET +if (needSize > nowSize) { val used = page.length if (used >= (1 << 30)) { sys.error("Can not build a HashedRelation that is larger than 8G") } - ensureAcquireMemory(used * 8L * 2) --- End diff -- ok . Doubling the size when growing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21165: [Spark-20087][CORE] Attach accumulators / metrics...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21165#discussion_r187905421 --- Diff: docs/rdd-programming-guide.md --- @@ -1548,6 +1548,9 @@ data.map(g) +In new version of Spark(> 2.3), the semantic of Accumulator has been changed a bit: it now includes updates from --- End diff -- it's not needed now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21165: [Spark-20087][CORE] Attach accumulators / metrics to 'Ta...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21165 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r187905010 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -898,4 +898,12 @@ object DDLUtils { "Cannot overwrite a path that is also being read from.") } } + + def verifyReadPath(query: LogicalPlan, outputPath: Path): Boolean = { --- End diff -- `isReadPath`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r187904340 --- Diff: core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala --- @@ -163,6 +169,12 @@ class HadoopMapReduceCommitProtocol( } override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = { +// first delete the should delete special file +val committerFs = jobContext.getWorkingDirectory.getFileSystem(jobContext.getConfiguration) --- End diff -- will this be different from `stagingDir.getFileSystem(jobContext.getConfiguration)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21257: [SPARK-24194] [SQL]HadoopFsRelation cannot overwr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21257#discussion_r187903129 --- Diff: core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala --- @@ -120,7 +120,8 @@ abstract class FileCommitProtocol { * Specifies that a file should be deleted with the commit of this job. The default * implementation deletes the file immediately. */ - def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean): Boolean = { + def deleteWithJob(fs: FileSystem, path: Path, recursive: Boolean, --- End diff -- seems the `recursive` is always passed as true? can we remove it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21291: [SPARK-24242][SQL] RangeExec should have correct ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21291#discussion_r187900810 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -621,6 +621,25 @@ class PlannerSuite extends SharedSQLContext { requiredOrdering = Seq(orderingA, orderingB), shouldHaveSort = true) } + + test("SPARK-24242: RangeExec should have correct output ordering") { +val df = spark.range(10).orderBy("id") --- End diff -- why do we put an `orderBy` in the query? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21291: [SPARK-24242][SQL] RangeExec should have correct ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21291#discussion_r187900467 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -621,6 +621,25 @@ class PlannerSuite extends SharedSQLContext { requiredOrdering = Seq(orderingA, orderingB), shouldHaveSort = true) } + + test("SPARK-24242: RangeExec should have correct output ordering") { --- End diff -- ordering and partitioning --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21291: [SPARK-24242][SQL] RangeExec should have correct outputO...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21291 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21316: [SPARK-20538][SQL] Wrap Dataset.reduce with withN...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/21316#discussion_r187899299 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1607,7 +1607,9 @@ class Dataset[T] private[sql]( */ @Experimental @InterfaceStability.Evolving - def reduce(func: (T, T) => T): T = rdd.reduce(func) + def reduce(func: (T, T) => T): T = withNewExecutionId { --- End diff -- @maropu When you asked about this API did you refer to `reduce` or `withNewExecutionId`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21106: [SPARK-23711][SQL][WIP] Add fallback logic for UnsafePro...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21106 mostly LGTM, though people may have better ideas about naming. cc @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/21320 @gatorsmile I believe this is the PR you requested for review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 I'm closing this PR in favor of #21320. That PR deals with simple projection and filter queries only. I will submit subsequent PRs for aggregation and join queries following the acceptance of #21320. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman closed the pull request at: https://github.com/apache/spark/pull/16578 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21106: [SPARK-23711][SQL][WIP] Add fallback logic for Un...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21106#discussion_r187897203 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala --- @@ -87,12 +87,11 @@ class InterpretedUnsafeProjection(expressions: Array[Expression]) extends Unsafe /** * Helper functions for creating an [[InterpretedUnsafeProjection]]. */ -object InterpretedUnsafeProjection extends UnsafeProjectionCreator { - +object InterpretedUnsafeProjection { /** * Returns an [[UnsafeProjection]] for given sequence of bound Expressions. */ - override protected def createProjection(exprs: Seq[Expression]): UnsafeProjection = { + protected[sql] def createProjection(exprs: Seq[Expression]): UnsafeProjection = { --- End diff -- Did you change it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/21320 [SPARK-4502][SQL] Parquet nested column pruning - foundation (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502) _N.B. This is a restart of PR #16578 which includes everything in that PR except the aggregation and join schema pruning rules. Relevant review comments from that PR should be considered incorporated by reference. Please avoid duplication in review by reviewing that PR first. The summary below is an edited copy of the summary of the previous PR._ ## What changes were proposed in this pull request? One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, `contacts`, backed by parquet with the following Spark SQL schema: ``` root |-- name: struct ||-- first: string ||-- last: string |-- address: string ``` Parquet stores this table's data in three physical columns: `name.first`, `name.last` and `address`. To answer the query ```SQL select address from contacts ``` Spark will read only from the `address` column of parquet data. However, to answer the query ```SQL select name.first from contacts ``` Spark will read `name.first` and `name.last` from parquet. This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the `name.first` column to answer the previous query. ### Implementation There are two main components of this patch. First, there is a `ParquetSchemaPruning` optimizer rule for gathering the required schema fields of a `PhysicalOperation` over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. `ParquetSchemaPruning` uses a new `ProjectionOverSchema` extractor for rewriting a catalyst expression in terms of a pruned schema. Second, the `ParquetRowConverter` has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. `ParquetReadSupport` has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader. ### Limitation Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only. ## How was this patch tested? Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug. We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-4502-parquet_column_pruning-foundation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21320.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21320 commit 9e301b37bb5863ef5fad8ec15f1d8f3d4b5e6a6f Author: Michael Allman Date: 2016-06-24T17:21:24Z [SPARK-4502][SQL] Parquet nested column pruning --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21106: [SPARK-23711][SQL][WIP] Add fallback logic for Un...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21106#discussion_r187897086 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CodeGeneratorWithInterpretedFallback.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.codehaus.commons.compiler.CompileException +import org.codehaus.janino.InternalCompilerException + +import org.apache.spark.TaskContext +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.Utils + +/** + * Catches compile error during code generation. + */ +object CodegenError { + def unapply(throwable: Throwable): Option[Exception] = throwable match { +case e: InternalCompilerException => Some(e) +case e: CompileException => Some(e) +case _ => None + } +} + +/** + * Defines values for `SQLConf` config of fallback mode. Use for test only. + */ +object CodegenObjectFactoryMode extends Enumeration { + val AUTO, CODEGEN_ONLY, NO_CODEGEN = Value + + def currentMode: CodegenObjectFactoryMode.Value = { +// If we weren't on task execution, accesses that config. +if (TaskContext.get == null) { + val config = SQLConf.get.getConf(SQLConf.CODEGEN_FACTORY_MODE) + CodegenObjectFactoryMode.withName(config) +} else { + CodegenObjectFactoryMode.AUTO +} + } +} + +/** + * A factory which can be used to create objects that have both codegen and interpreted --- End diff -- the comment also needs update. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19789: [SPARK-22562][Streaming] CachedKafkaConsumer unsafe evic...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19789 cc @marmbrus @zsxwing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19293: [SPARK-22079][SQL] Serializer in HiveOutputWriter miss l...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19293 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21114: [SPARK-22371][CORE] Return None instead of throwing an e...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21114 this behavior change LGTM, but the test is over complicated and seems has limited value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21114: [SPARK-22371][CORE] Return None instead of throwi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21114#discussion_r187891772 --- Diff: core/src/test/scala/org/apache/spark/AccumulatorSuite.scala --- @@ -237,6 +236,65 @@ class AccumulatorSuite extends SparkFunSuite with Matchers with LocalSparkContex acc.merge("kindness") assert(acc.value === "kindness") } + + test("updating garbage collected accumulators") { --- End diff -- what does this test do? prove accumulator can be GCed even it's valid? The map in `AccumulatorContext` is fixed-size so this definitely can happen and we don't need to prove it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21266: [SPARK-24206][SQL] Improve DataSource read benchmark cod...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21266 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org