[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21369 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3549/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19602: [SPARK-22384][SQL] Refine partition pruning when attribu...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19602 @cloud-fan Thanks a lot for looking into this. I updated the change and generalized `ExtractAttribute` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21369 **[Test build #91101 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91101/testReport)** for PR 21369 at commit [`bc7dc11`](https://github.com/apache/spark/commit/bc7dc11383db8370f755a058f4b908588f93edc8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3548/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21246 **[Test build #91100 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91100/testReport)** for PR 21246 at commit [`6fd8f2f`](https://github.com/apache/spark/commit/6fd8f2fbd37e5193f0ffb1a25a8f4a8c71ab55bd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21246 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91090/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21420 **[Test build #91090 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91090/testReport)** for PR 21420 at commit [`a41c99b`](https://github.com/apache/spark/commit/a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...
Github user eyalfa commented on a diff in the pull request: https://github.com/apache/spark/pull/21369#discussion_r190542635 --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala --- @@ -414,6 +415,99 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite with LocalSparkContext { sc.stop() } + test("spill during iteration") { +val size = 1000 +val conf = createSparkConf(loadDefaults = true) +sc = new SparkContext("local-cluster[1,1,1024]", "test", conf) +val map = createExternalMap[Int] + +map.insertAll((0 until size).iterator.map(i => (i / 10, i))) +assert(map.numSpills == 0, "map was not supposed to spill") + +val it = map.iterator +assert( it.isInstanceOf[CompletionIterator[_, _]]) +val underlyingIt = map.readingIterator +assert( underlyingIt != null ) +val underlyingMapIterator = underlyingIt.upstream +assert(underlyingMapIterator != null) +val underlyingMapIteratorClass = underlyingMapIterator.getClass +assert(underlyingMapIteratorClass.getEnclosingClass == classOf[AppendOnlyMap[_, _]]) + +val underlyingMap = map.currentMap +assert(underlyingMap != null) + +val first50Keys = for ( _ <- 0 until 50) yield { + val (k, vs) = it.next + val sortedVs = vs.sorted + assert(sortedVs.seq == (0 until 10).map(10 * k + _)) + k +} +assert( map.numSpills == 0 ) +map.spill(Long.MaxValue, null) +// these asserts try to show that we're no longer holding references to the underlying map. +// it'd be nice to use something like +// https://github.com/scala/scala/blob/2.13.x/test/junit/scala/tools/testing/AssertUtil.scala +// (lines 69-89) +assert(map.currentMap == null) +assert(underlyingIt.upstream ne underlyingMapIterator) +assert(underlyingIt.upstream.getClass != underlyingMapIteratorClass) +assert(underlyingIt.upstream.getClass.getEnclosingClass != classOf[AppendOnlyMap[_, _]]) --- End diff -- hmm, we can in line 508 but not in this test. in this test we look at the iterator immediately after a spill, at this point upstream is supposed to be replaced by a `DiskMapIterator`, I guess we can check for this directly (after relaxing its visibility to package private). in line 508, we can simply compare with Iterator.empty --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...
Github user gentlewangyu commented on a diff in the pull request: https://github.com/apache/spark/pull/21422#discussion_r190541918 --- Diff: docs/building-spark.md --- @@ -92,10 +92,10 @@ like ZooKeeper and Hadoop itself. ./build/mvn -Pmesos -DskipTests clean package ## Building for Scala 2.10 -To produce a Spark package compiled with Scala 2.10, use the `-Dscala-2.10` property: +To produce a Spark package compiled with Scala 2.10, use the `-Pscala-2.10` property: ./dev/change-scala-version.sh 2.10 -./build/mvn -Pyarn -Dscala-2.10 -DskipTests clean package +./build/mvn -Pyarn -scala-2.10 -DskipTestsP clean package --- End diff -- sorry , It's -Pscala-2.10 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21391 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21391 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91093/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21391 **[Test build #91093 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91093/testReport)** for PR 21391 at commit [`d9a440a`](https://github.com/apache/spark/commit/d9a440a9814913827fcfcff644c741a43332b02d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91095/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21246 **[Test build #91095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91095/testReport)** for PR 21246 at commit [`6fd8f2f`](https://github.com/apache/spark/commit/6fd8f2fbd37e5193f0ffb1a25a8f4a8c71ab55bd). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21422 Minor fix doesn't usually need a JIRA. Let's avoid next time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21422#discussion_r190525438 --- Diff: docs/building-spark.md --- @@ -92,10 +92,10 @@ like ZooKeeper and Hadoop itself. ./build/mvn -Pmesos -DskipTests clean package ## Building for Scala 2.10 -To produce a Spark package compiled with Scala 2.10, use the `-Dscala-2.10` property: +To produce a Spark package compiled with Scala 2.10, use the `-Pscala-2.10` property: ./dev/change-scala-version.sh 2.10 -./build/mvn -Pyarn -Dscala-2.10 -DskipTests clean package +./build/mvn -Pyarn -scala-2.10 -DskipTestsP clean package --- End diff -- DskipTestsP? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error i...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/21398#discussion_r190521452 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -230,11 +232,29 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat // specify location for managed table. And in [[CreateDataSourceTableAsSelectCommand]] we have // to create the table directory and write out data before we create this table, to avoid // exposing a partial written table. -val needDefaultTableLocation = tableDefinition.tableType == MANAGED && - tableDefinition.storage.locationUri.isEmpty - -val tableLocation = if (needDefaultTableLocation) { - Some(CatalogUtils.stringToURI(defaultTablePath(tableDefinition.identifier))) +// +// When using a remote metastore, and if a managed table is being created with its +// location explicitly set to the location where it would be created anyway, then do +// not set its location explicitly. This avoids an issue with Sentry in secure clusters. +// Otherwise, the above comment applies. --- End diff -- > The question is whether we want this workaround in Spark while Sentry is not fixed. We could open a bug for Sentry and since Spark uses hive which often comes with Sentry in enterprise envs then it would be good to point to that bug in the code , so when things get fixed we could change bug the code here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21331: [SPARK-24276][SQL] Order of literals in IN should not af...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21331 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3547/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21331: [SPARK-24276][SQL] Order of literals in IN should not af...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21331 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3425/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21416 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3425/ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 **[Test build #91098 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91098/testReport)** for PR 21317 at commit [`ac544f8`](https://github.com/apache/spark/commit/ac544f836a867da2c709fbfe6e0a7336e1d026c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21416 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91091/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91098/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21331: [SPARK-24276][SQL] Order of literals in IN should not af...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21331 **[Test build #91099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91099/testReport)** for PR 21331 at commit [`a0af525`](https://github.com/apache/spark/commit/a0af52524e30a9ace9d9a6239de79a7251a2499c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21416 **[Test build #91091 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91091/testReport)** for PR 21416 at commit [`de3296c`](https://github.com/apache/spark/commit/de3296c116b07d6560b1f89ca97c2014d36d10a8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21317 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3546/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21331#discussion_r190518402 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala --- @@ -85,6 +87,10 @@ object Canonicalize { case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r) case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r) +// order the list in the In operator +case In(value, list) => + In(value, list.sortBy(_.semanticHash())) --- End diff -- thanks for your great comment. This is very helpful and I'll keep in mind for the future. I think I addressed all the points now. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user skonto commented on the issue: https://github.com/apache/spark/pull/21317 @liyinan926 ready for another round. I think its ready for merge. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21317 **[Test build #91098 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91098/testReport)** for PR 21317 at commit [`ac544f8`](https://github.com/apache/spark/commit/ac544f836a867da2c709fbfe6e0a7336e1d026c1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21317: [SPARK-24232][k8s] Add support for secret env var...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/21317#discussion_r190514132 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/EnvSecretsFeatureStepSuite.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.features + +import io.fabric8.kubernetes.api.model.PodBuilder + +import org.apache.spark.{SparkConf, SparkFunSuite} +import org.apache.spark.deploy.k8s._ + +class EnvSecretsFeatureStepSuite extends SparkFunSuite{ + private val KEY_REF_NAME_FOO = "foo" + private val KEY_REF_NAME_BAR = "bar" + private val KEY_REF_KEY_FOO = "key_foo" + private val KEY_REF_KEY_BAR = "key_bar" + private val ENV_NAME_FOO = "MY_FOO" + private val ENV_NAME_BAR = "MY_bar" + + test("sets up all keyRefs") { +val baseDriverPod = SparkPod.initialPod() +val envVarsToKeys = Map( + ENV_NAME_BAR -> s"${KEY_REF_NAME_BAR}:${KEY_REF_KEY_BAR}", + ENV_NAME_FOO -> s"${KEY_REF_NAME_FOO}:${KEY_REF_KEY_FOO}") +val sparkConf = new SparkConf(false) +val kubernetesConf = KubernetesConf( + sparkConf, + KubernetesExecutorSpecificConf("1", new PodBuilder().build()), + "resource-name-prefix", + "app-id", + Map.empty, + Map.empty, + Map.empty, + envVarsToKeys, + Map.empty) + +val step = new EnvSecretsFeatureStep(kubernetesConf) +val driverContainerWithEnvSecrets = step.configurePod(baseDriverPod).container + +val expectedVars = + Seq(s"${ENV_NAME_BAR}", s"${ENV_NAME_FOO}") + +expectedVars.foreach { envName => + assert(SecretEnvUtils.containerHasEnvVar(driverContainerWithEnvSecrets, envName)) --- End diff -- As we discussed this is K8s responsibility we shouldnt verify the values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21295 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21295 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3545/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21295 **[Test build #91097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91097/testReport)** for PR 21295 at commit [`497bdd8`](https://github.com/apache/spark/commit/497bdd8fc581f3c40ae97eb56d0a5f65e7d42405). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21415: [SPARK-24244][SPARK-24368][SQL] Passing only required co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21415 **[Test build #91096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91096/testReport)** for PR 21415 at commit [`0aef16b`](https://github.com/apache/spark/commit/0aef16b5e9017fb398e0df2f3694a1db1f4d7cb8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21295 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21415: [SPARK-24244][SPARK-24368][SQL] Passing only required co...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/21415 jenkins, retest this, please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21260: [SPARK-23529][K8s] Support mounting volumes
Github user andrusha commented on the issue: https://github.com/apache/spark/pull/21260 @liyinan926 yeap, ready for review. If you think it makes sense to make refactors I'm speaking of then I can add those too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21405#discussion_r190502747 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -149,6 +152,34 @@ trait Block extends JavaCode { // Concatenates this block with other block. def + (other: Block): Block + + /** + * Apply a map function to each java expression codes present in this java code, and return a new + * java code based on the mapped java expression codes. + */ + def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): this.type = { --- End diff -- Ok. Will add them in next commit. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21422 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...
Github user gentlewangyu commented on the issue: https://github.com/apache/spark/pull/21422 I have verified this patch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21405#discussion_r190502024 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -149,6 +152,34 @@ trait Block extends JavaCode { // Concatenates this block with other block. def + (other: Block): Block + + /** + * Apply a map function to each java expression codes present in this java code, and return a new + * java code based on the mapped java expression codes. + */ + def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): this.type = { --- End diff -- I see, can we add some tests for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3544/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...
GitHub user gentlewangyu opened a pull request: https://github.com/apache/spark/pull/21422 [Spark-24376][doc]Summary:compiling spark with scala-2.10 should use the -P parameter instead of -D What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gentlewangyu/spark SPARK-24376 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21422.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21422 commit d5167ed3dd3ce87a8904c2444eaf6c839ea541d3 Author: wangyu Date: 2018-05-24T08:06:47Z git config --global user.email "wan...@cmss.chinamobile.com" git config --global user.name "gentlewangyu" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21422 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21405#discussion_r190501136 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -149,6 +152,34 @@ trait Block extends JavaCode { // Concatenates this block with other block. def + (other: Block): Block + + /** + * Apply a map function to each java expression codes present in this java code, and return a new + * java code based on the mapped java expression codes. + */ + def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): this.type = { --- End diff -- `transformExprValues` will create a new instance `Block` because a block is immutable. So once there are change, new block should take new `exprValues`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21246 **[Test build #91095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91095/testReport)** for PR 21246 at commit [`6fd8f2f`](https://github.com/apache/spark/commit/6fd8f2fbd37e5193f0ffb1a25a8f4a8c71ab55bd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21246 **[Test build #91094 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91094/testReport)** for PR 21246 at commit [`fca3af8`](https://github.com/apache/spark/commit/fca3af8d7ed57b347bec046ab1ca7b2227286fec). * This patch **fails Java style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91094/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3543/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21246 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/21405#discussion_r190497595 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala --- @@ -149,6 +152,34 @@ trait Block extends JavaCode { // Concatenates this block with other block. def + (other: Block): Block + + /** + * Apply a map function to each java expression codes present in this java code, and return a new + * java code based on the mapped java expression codes. + */ + def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): this.type = { --- End diff -- am I wrong or we are not updating the `exprValues` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21246 **[Test build #91094 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91094/testReport)** for PR 21246 at commit [`fca3af8`](https://github.com/apache/spark/commit/fca3af8d7ed57b347bec046ab1ca7b2227286fec). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21246 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21398#discussion_r190494840 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -230,11 +232,29 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat // specify location for managed table. And in [[CreateDataSourceTableAsSelectCommand]] we have // to create the table directory and write out data before we create this table, to avoid // exposing a partial written table. -val needDefaultTableLocation = tableDefinition.tableType == MANAGED && - tableDefinition.storage.locationUri.isEmpty - -val tableLocation = if (needDefaultTableLocation) { - Some(CatalogUtils.stringToURI(defaultTablePath(tableDefinition.identifier))) +// +// When using a remote metastore, and if a managed table is being created with its +// location explicitly set to the location where it would be created anyway, then do +// not set its location explicitly. This avoids an issue with Sentry in secure clusters. +// Otherwise, the above comment applies. --- End diff -- > IIRC you can't change the location of the default database in Hive good to know that, maybe we should apply the same rule to Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190494243 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala --- @@ -0,0 +1,432 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.scalatest.BeforeAndAfterAll +import org.scalatest.exceptions.TestFailedException + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.expressions.NamedExpression +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation +import org.apache.spark.sql.types._ + +// scalastyle:off line.size.limit +class SelectedFieldSuite extends SparkFunSuite with BeforeAndAfterAll { + // The test schema as a tree string, i.e. `schema.treeString` + // root + // |-- col1: string (nullable = false) + // |-- col2: struct (nullable = true) + // ||-- field1: integer (nullable = true) + // ||-- field2: array (nullable = true) + // |||-- element: integer (containsNull = false) + // ||-- field3: array (nullable = false) + // |||-- element: struct (containsNull = true) + // ||||-- subfield1: integer (nullable = true) + // ||||-- subfield2: integer (nullable = true) + // ||||-- subfield3: array (nullable = true) + // |||||-- element: integer (containsNull = true) + // ||-- field4: map (nullable = true) + // |||-- key: string + // |||-- value: struct (valueContainsNull = false) + // ||||-- subfield1: integer (nullable = true) + // ||||-- subfield2: array (nullable = true) + // |||||-- element: integer (containsNull = false) + // ||-- field5: array (nullable = false) + // |||-- element: struct (containsNull = true) + // ||||-- subfield1: struct (nullable = false) + // |||||-- subsubfield1: integer (nullable = true) + // |||||-- subsubfield2: integer (nullable = true) + // ||||-- subfield2: struct (nullable = true) + // |||||-- subsubfield1: struct (nullable = true) + // ||||||-- subsubsubfield1: string (nullable = true) + // |||||-- subsubfield2: integer (nullable = true) + // ||-- field6: struct (nullable = true) + // |||-- subfield1: string (nullable = false) + // |||-- subfield2: string (nullable = true) + // ||-- field7: struct (nullable = true) + // |||-- subfield1: struct (nullable = true) + // ||||-- subsubfield1: integer (nullable = true) + // ||||-- subsubfield2: integer (nullable = true) + // ||-- field8: map (nullable = true) + // |||-- key: string + // |||-- value: array (valueContainsNull = false) + // ||||-- element: struct (containsNull = true) + // |||||-- subfield1: integer (nullable = true) + // |||||-- subfield2: array (nullable = true) + // ||||||-- element: integer (containsNull = false) + // ||-- field9: map (nullable = true) + // |||-- key: string + // |||-- value: integer (valueContainsNull = false) + // |-- col3: array (nullable = false) + // ||-- element: struct (containsNull = false) + // |||-- field1: struct (nullable = true) + // ||||-- subfield1: integer (nullable = false) + // ||||-- subfield2: integer (nullable = true) + // |||-- field2: map (nullable = true) + // ||||-- key: string + // ||||-- value: integer (valueContainsN
[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21369#discussion_r190494153 --- Diff: core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala --- @@ -414,6 +415,99 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite with LocalSparkContext { sc.stop() } + test("spill during iteration") { +val size = 1000 +val conf = createSparkConf(loadDefaults = true) +sc = new SparkContext("local-cluster[1,1,1024]", "test", conf) +val map = createExternalMap[Int] + +map.insertAll((0 until size).iterator.map(i => (i / 10, i))) +assert(map.numSpills == 0, "map was not supposed to spill") + +val it = map.iterator +assert( it.isInstanceOf[CompletionIterator[_, _]]) +val underlyingIt = map.readingIterator +assert( underlyingIt != null ) +val underlyingMapIterator = underlyingIt.upstream +assert(underlyingMapIterator != null) +val underlyingMapIteratorClass = underlyingMapIterator.getClass +assert(underlyingMapIteratorClass.getEnclosingClass == classOf[AppendOnlyMap[_, _]]) + +val underlyingMap = map.currentMap +assert(underlyingMap != null) + +val first50Keys = for ( _ <- 0 until 50) yield { + val (k, vs) = it.next + val sortedVs = vs.sorted + assert(sortedVs.seq == (0 until 10).map(10 * k + _)) + k +} +assert( map.numSpills == 0 ) +map.spill(Long.MaxValue, null) +// these asserts try to show that we're no longer holding references to the underlying map. +// it'd be nice to use something like +// https://github.com/scala/scala/blob/2.13.x/test/junit/scala/tools/testing/AssertUtil.scala +// (lines 69-89) +assert(map.currentMap == null) +assert(underlyingIt.upstream ne underlyingMapIterator) +assert(underlyingIt.upstream.getClass != underlyingMapIteratorClass) +assert(underlyingIt.upstream.getClass.getEnclosingClass != classOf[AppendOnlyMap[_, _]]) --- End diff -- can we simply check `assert(underlyingIt.upstream eq Iterator.empty)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190493689 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +if (SQLConf.get.nestedSchemaPruningEnabled) { + apply0(plan) +} else { + plan +} + + private def apply0(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, partitionSchema, +dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) => +val projectionFields = projects.flatMap(getFields) +val filterFields = filters.flatMap(getFields) +val requestedFields = (projectionFields ++ filterFields).distinct + +// If [[requestedFields]] includes a nested field, continue. Otherwise, +// return [[op]] +if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) { + val prunedSchema = requestedFields +.map { case (field, _) => StructType(Array(field)) } +.reduceLeft(_ merge _) + val dataSchemaFieldNames = dataSchema.fieldNames.toSet + val prunedDataSchema = +StructType(prunedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) + + // If the data schema is different from the pruned data schema, continue. Otherwise, + // return [[op]]. We effect this comparison by counting the number of "leaf" fields in + // each schemata, assuming the fields in [[prunedDataSchema]] are a subset of the fields + // in [[dataSchema]]. + if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) { +val prunedParquetRelation = + hadoopFsRelation.copy(dataSchema = prunedDataSchema)(hadoopFsRelation.sparkSession) + +// We need to replace the expression ids of the pruned relation output attributes +// with the expression ids of the original relation output attributes so that +// references to the original relation's output are not broken +val outputIdMap = l.output.map(att => (att.name, att.exprId)).toMap +val prunedRelationOutput = + prunedParquetRelation +.schema +.toAttributes +.map { + case att if outputIdMap.contains(att.name) => +att.withExprId(outputIdMap(att.name)) + case att => att +} +val prunedRelation = + l.copy(relation = prunedParquetRelation, output = prunedRelationOutput) + +
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190492424 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,154 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +if (SQLConf.get.nestedSchemaPruningEnabled) { + apply0(plan) +} else { + plan +} + + private def apply0(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, partitionSchema, +dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) => +val projectionFields = projects.flatMap(getFields) +val filterFields = filters.flatMap(getFields) +val requestedFields = (projectionFields ++ filterFields).distinct + +// If [[requestedFields]] includes a nested field, continue. Otherwise, +// return [[op]] +if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) { + val prunedSchema = requestedFields +.map { case (field, _) => StructType(Array(field)) } +.reduceLeft(_ merge _) + val dataSchemaFieldNames = dataSchema.fieldNames.toSet + val prunedDataSchema = +StructType(prunedSchema.filter(f => dataSchemaFieldNames.contains(f.name))) + + // If the data schema is different from the pruned data schema, continue. Otherwise, + // return [[op]]. We effect this comparison by counting the number of "leaf" fields in + // each schemata, assuming the fields in [[prunedDataSchema]] are a subset of the fields + // in [[dataSchema]]. + if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) { +val prunedParquetRelation = + hadoopFsRelation.copy(dataSchema = prunedDataSchema)(hadoopFsRelation.sparkSession) + +// We need to replace the expression ids of the pruned relation output attributes +// with the expression ids of the original relation output attributes so that +// references to the original relation's output are not broken +val outputIdMap = l.output.map(att => (att.name, att.exprId)).toMap +val prunedRelationOutput = + prunedParquetRelation +.schema +.toAttributes +.map { + case att if outputIdMap.contains(att.name) => +att.withExprId(outputIdMap(att.name)) + case att => att +} +val prunedRelation = + l.copy(relation = prunedParquetRelation, output = prunedRelationOutput) + +
[GitHub] spark pull request #21421: Myfeature
Github user gentlewangyu closed the pull request at: https://github.com/apache/spark/pull/21421 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21421: Myfeature
GitHub user gentlewangyu opened a pull request: https://github.com/apache/spark/pull/21421 Myfeature ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gentlewangyu/spark myfeature Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21421.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21421 commit 9949fed1c45865b6e5e8ebe610789c5fb9546052 Author: Corey Woodfield Date: 2017-07-19T22:21:38Z [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset#joinWith ## What changes were proposed in this pull request? Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both ## How was this patch tested? I ran the following code : ``` public static void main(String[] args) { SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test")); Dataset one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class); Dataset two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class); try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();} try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();} } ``` which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x. Author: Corey Woodfield Closes #18462 from coreywoodfield/master. (cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982) Signed-off-by: gatorsmile commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be Author: Dhruve Ashar Date: 2017-07-21T19:03:46Z [SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch For configurations with external shuffle enabled, we have observed that if a very large no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs being fetched from a given remote address. The changes applied here are applicable for both the scenarios - when external shuffle is enabled as well as disabled. Ran the job with the default configuration which does not change the existing behavior and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there is no change in the output. (I will update the metrics related to NM in some time.) Author: Dhruve Ashar Closes #18487 from dhruve/impr/SPARK-21243. Author: Dhruve Ashar Closes #18691 from dhruve
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21391 **[Test build #91093 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91093/testReport)** for PR 21391 at commit [`d9a440a`](https://github.com/apache/spark/commit/d9a440a9814913827fcfcff644c741a43332b02d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user yucai commented on the issue: https://github.com/apache/spark/pull/21391 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190491050 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -879,6 +879,15 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("select function over nested data") { --- End diff -- I did some forensics to understand the origin and reason for this test. It was part of the first commit of my original PR almost two years ago. Even then, this test passes without the rest of the commit. So I can't say why I added this test except perhaps that I felt it was useful code coverage. In any case, if you don't think it's a valuable contribution in this PR I'd rather just remove it entirely. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21416 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3542/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21416 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19560 **[Test build #91092 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)** for PR 19560 at commit [`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21419: Branch 2.2
Github user gentlewangyu closed the pull request at: https://github.com/apache/spark/pull/21419 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21419: Branch 2.2
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21419 @gentlewangyu, please close this and read https://spark.apache.org/contributing.html. Questions should go to mailing list and issues should be filed in JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3541/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21416 **[Test build #91091 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91091/testReport)** for PR 21416 at commit [`de3296c`](https://github.com/apache/spark/commit/de3296c116b07d6560b1f89ca97c2014d36d10a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190486220 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1256,8 +1256,18 @@ object SQLConf { "issues. Turn on this config to insert a local sort before actually doing repartition " + "to generate consistent repartition results. The performance of repartition() may go " + "down since we insert extra local sort before it.") +.booleanConf +.createWithDefault(true) + + val NESTED_SCHEMA_PRUNING_ENABLED = +buildConf("spark.sql.nestedSchemaPruning.enabled") + .internal() + .doc("Prune nested fields from a logical relation's output which are unnecessary in " + +"satisfying a query. This optimization allows columnar file format readers to avoid " + +"reading unnecessary nested column data. Currently Parquet is the only data source that " + --- End diff -- ORC should be able to support this capability as well, but this PR does not address that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21416 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21069: [SPARK-23920][SQL]add array_remove to remove all ...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21069#discussion_r190485891 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -1882,3 +1882,123 @@ case class ArrayRepeat(left: Expression, right: Expression) } } + +/** + * Remove all elements that equal to element from the given array + */ +@ExpressionDescription( + usage = "_FUNC_(array, element) - Remove all elements that equal to element from array.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3, null, 3), 3); + [1,2,null] + """, since = "2.4.0") +case class ArrayRemove(left: Expression, right: Expression) + extends BinaryExpression with ImplicitCastInputTypes { + + override def dataType: DataType = left.dataType + + override def inputTypes: Seq[AbstractDataType] = +Seq(ArrayType, left.dataType.asInstanceOf[ArrayType].elementType) + + lazy val elementType: DataType = left.dataType.asInstanceOf[ArrayType].elementType + + @transient private lazy val ordering: Ordering[Any] = +TypeUtils.getInterpretedOrdering(right.dataType) + + override def checkInputDataTypes(): TypeCheckResult = { +if (!left.dataType.isInstanceOf[ArrayType] + || left.dataType.asInstanceOf[ArrayType].elementType != right.dataType) { + TypeCheckResult.TypeCheckFailure( +"Arguments must be an array followed by a value of same type as the array members") +} else { + TypeUtils.checkForOrderingExpr(right.dataType, s"function $prettyName") +} + } + + override def nullSafeEval(arr: Any, value: Any): Any = { +val newArray = new Array[Any](arr.asInstanceOf[ArrayData].numElements()) +var pos = 0 +arr.asInstanceOf[ArrayData].foreach(right.dataType, (i, v) => + if (v == null) { +if (value != null) { --- End diff -- nit: Do we need this check since we are in `nullSafeEval`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190485768 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -286,7 +286,19 @@ case class FileSourceScanExec( } getOrElse { metadata } -withOptPartitionCount +val withOptColumnCount = relation.fileFormat match { + case columnar: ColumnarFileFormat => +SparkSession + .getActiveSession + .map { sparkSession => +val columnCount = columnar.columnCountForSchema(sparkSession, requiredSchema) +withOptPartitionCount + ("ColumnCount" -> columnCount.toString) --- End diff -- Replied above. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21416 test this again --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190485713 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala --- @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.types.StructType + +/** + * An optional mix-in for columnar [[FileFormat]]s. This trait provides some helpful metadata when + * debugging a physical query plan. + */ +private[sql] trait ColumnarFileFormat { --- End diff -- I'm okay with adding this as a separate PR as requested, but I wouldn't want to ship this PR in a release without being able to see the physical column count associated with a query in a query plan. That is invaluable functionality in validating that physical column pruning is occurring as expected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190485347 --- Diff: python/pyspark/sql/tests.py --- @@ -3040,6 +3040,50 @@ def test_csv_sampling_ratio(self): .csv(rdd, samplingRatio=0.5).schema self.assertEquals(schema, StructType([StructField("_c0", IntegerType(), True)])) +def _get_content(self, content): +""" +Strips leading spaces from content up to the first '|' in each line. +""" +import re +pattern = re.compile(r'^ *\|', re.MULTILINE) --- End diff -- We don't have to compile the pattern each time here since it's not going to be reused. You could just put this into re.sub I believe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21420 **[Test build #91090 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91090/testReport)** for PR 21420 at commit [`a41c99b`](https://github.com/apache/spark/commit/a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r190484386 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -99,27 +100,28 @@ trait ConstraintHelper { } /** - * Infer the Attribute-specific IsNotNull constraints from the null intolerant child expressions - * of constraints. + * Infer the Attribute and ExtractValue-specific IsNotNull constraints from the null intolerant + * child expressions of constraints. */ private def inferIsNotNullConstraints(constraint: Expression): Seq[Expression] = constraint match { // When the root is IsNotNull, we can push IsNotNull through the child null intolerant // expressions - case IsNotNull(expr) => scanNullIntolerantAttribute(expr).map(IsNotNull(_)) + case IsNotNull(expr) => scanNullIntolerantField(expr).map(IsNotNull(_)) // Constraints always return true for all the inputs. That means, null will never be returned. // Thus, we can infer `IsNotNull(constraint)`, and also push IsNotNull through the child // null intolerant expressions. - case _ => scanNullIntolerantAttribute(constraint).map(IsNotNull(_)) + case _ => scanNullIntolerantField(constraint).map(IsNotNull(_)) } /** - * Recursively explores the expressions which are null intolerant and returns all attributes - * in these expressions. + * Recursively explores the expressions which are null intolerant and returns all attributes and + * complex type extractors in these expressions. */ - private def scanNullIntolerantAttribute(expr: Expression): Seq[Attribute] = expr match { + private def scanNullIntolerantField(expr: Expression): Seq[Expression] = expr match { +case ev: ExtractValue => Seq(ev) --- End diff -- I agree adding more direct and independent test coverage for this change is a good idea. However, omitting this change will weaken the capabilities of this PR. It would also imply the removal of the failing test case in `ParquetSchemaPruningSuite`, which would imply two follow on PRs. The first would be to add this specific change plus the right test coverage. The next would be to restore the test case removed from 'ParquetSchemaPruningSuite'. Let me suggest an alternative. As this change is a valuable enhancement for this PR, let me try adding an appropriate test case in `InferFiltersFromConstraintsSuite` as part of this PR. That will eliminate the requirement for two more follow-on PRs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/21420 CC @HyukjinKwon @vanzin please help to review, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/21420 [SPARK-24377][Spark Submit] make --py-files work in non pyspark application ## What changes were proposed in this pull request? For some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies. One example is Livy remote SparkContext application, this application is actually an embedded REPL for Scala/Python/R, it will not only load in jar dependencies, but also python and R deps, so we should specify not only "--jars", but also "--py-files". Currently for a Spark application, --py-files can only be worked for a pyspark application, so it will not be worked in the above case. So here propose to remove such restriction. Also we tested that "spark.submit.pyFiles" only supports quite limited scenario (client mode with local deps), so here also expand the usage of "spark.submit.pyFiles" to be alternative of --py-files. ## How was this patch tested? UT added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-24377 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21420.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21420 commit a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4 Author: jerryshao Date: 2018-05-24T06:53:23Z make --py-files work in non pyspark application --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21295 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21295 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91086/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21069 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21069 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91081/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21391 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21391 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91088/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21399: [SPARK-22269][BUILD] Run Java linter via SBT for Jenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21399 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21390: [SPARK-24340][Core] Clean up non-shuffle disk block mana...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21390 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91083/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org