[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19861 **[Test build #84433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84433/testReport)** for PR 19861 at commit [`ec5723c`](https://github.com/apache/spark/commit/ec5723c194474c85af0c4bd6265c6f7b0781881e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Do not pass global variables to argum...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19865 Thank you. I think so for this case. In general, to make global `ev.value` or to pass a global variable to `consume()` may potentially cause this problem. As @cloud-fan suggested, I identified these patterns and am fixing them. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19878 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84431/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add tasksPerExecutorSlot parameter
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19881 Please see JIRA. I don't think this is worth doing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19878 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19878 **[Test build #84431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)** for PR 19878 at commit [`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19881: [SPARK-22683][CORE] Add tasksPerExecutorSlot parameter
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19881 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19877: [SPARK-22681]Accumulator should only be updated o...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19877#discussion_r154729236 --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala --- @@ -1832,6 +1832,27 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi assertDataStructuresEmpty() } + test("accumulator not calculated for resubmitted task in result stage") { +// just for register --- End diff -- nit: unnecessary (and confusing?) comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19881: [SPARK-22683][CORE] Add tasksPerExecutorSlot para...
GitHub user jcuquemelle opened a pull request: https://github.com/apache/spark/pull/19881 [SPARK-22683][CORE] Add tasksPerExecutorSlot parameter ## What changes were proposed in this pull request? let's say an executor has spark.executor.cores / spark.task.cpus taskSlots The current dynamic allocation policy allocates enough executors to have each taskSlot execute a single task, which wastes resources when tasks are small regarding executor allocation overhead. By adding the tasksPerExecutorSlot, it is made possible to specify how many tasks a single slot should ideally execute to mitigate the overhead of executor allocation. ## How was this patch tested? Units tests and runs on various actual workloads on a Yarn Cluster You can merge this pull request into a Git repository by running: $ git pull https://github.com/jcuquemelle/spark AddTaskPerExecutorSlot Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19881.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19881 commit 895882feebc53f44a70278e0b475b2fb937d331a Author: Julien CuquemelleDate: 2017-11-30T16:28:06Z [SPARK-22683][CORE] Allow tuning the number of dynamically allocated executors let's say an executor has spark.executor.cores / spark.task.cpus taskSlots The current dynamic allocation policy allocates enough executors to have each taskSlot execute a single task, which wastes resources when tasks are small regarding executor allocation overhead. By adding the tasksPerExecutorSlot, it is made possible to specify how many tasks a single slot should ideally execute to mitigate the overhead of executor allocation. commit fce3b976d0b22c4d01ef4fdd5339835bc6d6fcb1 Author: Julien Cuquemelle Date: 2017-11-30T16:28:06Z [SPARK-22683][DOC] document tasksPerExecutorSlot parameter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84438 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84438/testReport)** for PR 19717 at commit [`cfcf2a7`](https://github.com/apache/spark/commit/cfcf2a7622f2bd4a21ea2d5197ec652363645aca). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19717 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84437 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84437/testReport)** for PR 19717 at commit [`05f528a`](https://github.com/apache/spark/commit/05f528a0178fa0f567b8904780d837a4f95d888b). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19717 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84437/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19717 **[Test build #84437 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84437/testReport)** for PR 19717 at commit [`05f528a`](https://github.com/apache/spark/commit/05f528a0178fa0f567b8904780d837a4f95d888b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19880 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84434/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19880 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19880 **[Test build #84434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84434/testReport)** for PR 19880 at commit [`9be829d`](https://github.com/apache/spark/commit/9be829d208f7e2d6a88b9d2008fc04eec4a4ad8e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154726920 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,43 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine --- End diff -- Seems openjdk is under GPL2. Should they be listed following `This product optionally depends on 'Webbit'` in a similar format and with the license files included under `license`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user rezasafi commented on the issue: https://github.com/apache/spark/pull/19848 Thank you very much @vanzin, @mridulm and @jiangxb1987. I really appreciate it. I will create PR for branch 2.2 ASAP. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154724747 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,43 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine + +# If this docker file is being used in the context of building your images from a Spark distribution, the docker build +# command should be invoked from the top level directory of the Spark distribution. E.g.: +# docker build -t spark-base:latest -f dockerfiles/spark-base/Dockerfile . + +RUN apk upgrade --no-cache && \ --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154721032 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.submit + +import java.util.concurrent.{CountDownLatch, TimeUnit} + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.{ContainerStateRunning, ContainerStateTerminated, ContainerStateWaiting, ContainerStatus, Pod, Time} +import io.fabric8.kubernetes.client.{KubernetesClientException, Watcher} +import io.fabric8.kubernetes.client.Watcher.Action + +import org.apache.spark.SparkException +import org.apache.spark.internal.Logging +import org.apache.spark.util.ThreadUtils + +private[k8s] trait LoggingPodStatusWatcher extends Watcher[Pod] { + def awaitCompletion(): Unit +} + +/** + * A monitor for the running Kubernetes pod of a Spark application. Status logging occurs on + * every state change and also at an interval for liveness. + * + * @param appId application ID. + * @param maybeLoggingInterval ms between each state request. If provided, must be a positive + * number. + */ +private[k8s] class LoggingPodStatusWatcherImpl( +appId: String, +maybeLoggingInterval: Option[Long]) + extends LoggingPodStatusWatcher with Logging { + + private val podCompletedFuture = new CountDownLatch(1) + // start timer for periodic logging + private val scheduler = + ThreadUtils.newDaemonSingleThreadScheduledExecutor("logging-pod-status-watcher") + private val logRunnable: Runnable = new Runnable { +override def run() = logShortStatus() + } + + private var pod = Option.empty[Pod] + + private def phase: String = pod.map(_.getStatus.getPhase).getOrElse("unknown") + + def start(): Unit = { +maybeLoggingInterval.foreach { interval => + require(interval > 0, s"Logging interval must be a positive time value, got: $interval ms.") + scheduler.scheduleAtFixedRate(logRunnable, 0, interval, TimeUnit.MILLISECONDS) +} + } + + override def eventReceived(action: Action, pod: Pod): Unit = { +this.pod = Option(pod) +action match { + case Action.DELETED | Action.ERROR => +closeWatch() + + case _ => +logLongStatus() +if (hasCompleted()) { + closeWatch() +} +} + } + + override def onClose(e: KubernetesClientException): Unit = { +logDebug(s"Stopping watching application $appId with last-observed phase $phase") +closeWatch() + } + + private def logShortStatus() = { +logInfo(s"Application status for $appId (phase: $phase)") + } + + private def logLongStatus() = { +logInfo("State changed, new state: " + pod.map(formatPodState).getOrElse("unknown")) + } + + private def hasCompleted(): Boolean = { +phase == "Succeeded" || phase == "Failed" + } + + private def closeWatch(): Unit = { +podCompletedFuture.countDown() +scheduler.shutdown() + } + + private def formatPodState(pod: Pod): String = { +// TODO include specific container state --- End diff -- Actually it already includes the `containerStatuses`. Removed this TODO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19840 I'm trying to understand what is https://github.com/apache/spark/blob/master/python/pyspark/context.py#L191 really achieving. It seems pretty broken to me and feels like the whole `pythonExec` tracking in the various places should be removed. It causes this problem because it forces the executor to use the driver's python even if it's been set to a different path by the user. It uses `python` instead of `sys.executable` as the default value. And it ignores the `spark.pyspark.python` config value if it's set. Instead, shouldn't the logic at https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L304 be used in `PythonRunner` (except for the driver python config) to find out the executor's python to use? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154719522 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile --- @@ -0,0 +1,43 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM openjdk:8-alpine + +# If this docker file is being used in the context of building your images from a Spark distribution, the docker build +# command should be invoked from the top level directory of the Spark distribution. E.g.: +# docker build -t spark-base:latest -f dockerfiles/spark-base/Dockerfile . --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19870 To ensure no external behavior change, we should use RoundRobinPartitioning when the expression list is empty, instead of issuing an exception. It will be also consistent with the existing behavior of `df.repartition(numPartitions)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19848 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19848 (There was a conflict in 2.2, open a new PR if you want it there.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/19848 Merging to master / 2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19880 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84430/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19879 **[Test build #4003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4003/testReport)** for PR 19879 at commit [`f265daa`](https://github.com/apache/spark/commit/f265daa68ba38f00aa274d1d431fe74e974f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84430/testReport)** for PR 19873 at commit [`9f5a0e4`](https://github.com/apache/spark/commit/9f5a0e458fa0cb42d6850e16d74994af1b1a3752). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class AnalysisBarrier(child: LogicalPlan) extends LeafNode ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19879 **[Test build #4003 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4003/testReport)** for PR 19879 at commit [`f265daa`](https://github.com/apache/spark/commit/f265daa68ba38f00aa274d1d431fe74e974f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19879 OK, so another back-port to fix the 2.11 REPL, which should be resolved in 2.12 anyway? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated D...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19875 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19854: SPARK-22660:Use position() and limit() to fix ambiguity ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19854 Ping @kellyzly --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84436 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84436/testReport)** for PR 19871 at commit [`2393e1d`](https://github.com/apache/spark/commit/2393e1de729441b27bc5cdd83804071f14d77a4b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19875 Merged to master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19871 @HyukjinKwon , for enabling the following test, I'm restructuring ORC tests now. I'll make a PR today for that. ``` ignore("LZO compression options for writing to an ORC file not supported in Hive 1.2.1") { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10621: [SPARK-12617][PySpark]Move Py4jCallbackConnectionCleaner...
Github user cpalomaressbd commented on the issue: https://github.com/apache/spark/pull/10621 Hi, One question, maybe is stupid question I dont know and I ask sorry in advance. We are workin with hortonworks, in the beginning of the project we were working with HDP 2.4.0 which one has the spark 1.6.0 and wich one has a bug in pyspark, this bug wich one you say resolved in: **zsxwing commented on 22 Feb 2016** The first solution we have done was upgrade to HDP 2.4.3 wich one has spark 1.6.2 and wich one teorically must to have the patch to this problem, but Surprise we have the same bug without the patch. Ok, Maybe the problem is with hortonworks and I should write in other forum, but I was in the oficial website of spark: https://spark.apache.org/downloads.html If you choose the oficial version of spark 1.6.2 and you download this version, wich one was released June 25 2016, if I go to pyspark file, the bug continue in this installation I am sorry but I dont understadn and We are desperated with this situation. Should I appply the patch in source and recompile the code? How can I do it step to step? Thanks in Advance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154709256 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { --- End diff -- So, are you suggesting `lookupDataSource(provider, useNewOrc=true)`, @jiangxb1987 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84435/testReport)** for PR 19869 at commit [`f0320d5`](https://github.com/apache/spark/commit/f0320d5f7b3406334a5250ecc2d9c5248142f34d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154707683 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -85,7 +87,8 @@ case class DataSource( case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String]) - lazy val providingClass: Class[_] = DataSource.lookupDataSource(className) + lazy val providingClass: Class[_] = +DataSource.lookupDataSource(sparkSession.sessionState.conf, className) --- End diff -- Sure! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154707798 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import java.util.concurrent.TimeUnit + +import org.apache.spark.{SPARK_VERSION => sparkVersion} +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config.ConfigBuilder +import org.apache.spark.network.util.ByteUnit + +private[spark] object Config extends Logging { + + val KUBERNETES_NAMESPACE = +ConfigBuilder("spark.kubernetes.namespace") + .doc("The namespace that will be used for running the driver and executor pods. When using " + +"spark-submit in cluster mode, this can also be passed to spark-submit via the " + +"--kubernetes-namespace command line argument.") + .stringConf + .createWithDefault("default") + + val DRIVER_DOCKER_IMAGE = +ConfigBuilder("spark.kubernetes.driver.docker.image") + .doc("Docker image to use for the driver. Specify this using the standard Docker tag format.") + .stringConf + .createWithDefault(s"spark-driver:$sparkVersion") + + val EXECUTOR_DOCKER_IMAGE = +ConfigBuilder("spark.kubernetes.executor.docker.image") + .doc("Docker image to use for the executors. Specify this using the standard Docker tag " + +"format.") + .stringConf + .createWithDefault(s"spark-executor:$sparkVersion") + + val DOCKER_IMAGE_PULL_POLICY = +ConfigBuilder("spark.kubernetes.docker.image.pullPolicy") + .doc("Docker image pull policy when pulling any docker image in Kubernetes integration") + .stringConf + .createWithDefault("IfNotPresent") + + + val KUBERNETES_AUTH_DRIVER_CONF_PREFIX = + "spark.kubernetes.authenticate.driver" + val KUBERNETES_AUTH_DRIVER_MOUNTED_CONF_PREFIX = + "spark.kubernetes.authenticate.driver.mounted" + val OAUTH_TOKEN_CONF_SUFFIX = "oauthToken" + val OAUTH_TOKEN_FILE_CONF_SUFFIX = "oauthTokenFile" + val CLIENT_KEY_FILE_CONF_SUFFIX = "clientKeyFile" + val CLIENT_CERT_FILE_CONF_SUFFIX = "clientCertFile" + val CA_CERT_FILE_CONF_SUFFIX = "caCertFile" + + val KUBERNETES_SERVICE_ACCOUNT_NAME = + ConfigBuilder(s"$KUBERNETES_AUTH_DRIVER_CONF_PREFIX.serviceAccountName") + .doc("Service account that is used when running the driver pod. The driver pod uses " + +"this service account when requesting executor pods from the API server. If specific " + +"credentials are given for the driver pod to use, the driver will favor " + +"using those credentials instead.") + .stringConf + .createOptional + + val KUBERNETES_DRIVER_LIMIT_CORES = +ConfigBuilder("spark.kubernetes.driver.limit.cores") + .doc("Specify the hard cpu limit for the driver pod") + .stringConf + .createOptional + + val KUBERNETES_EXECUTOR_LIMIT_CORES = +ConfigBuilder("spark.kubernetes.executor.limit.cores") + .doc("Specify the hard cpu limit for a single executor pod") + .stringConf + .createOptional + + val KUBERNETES_DRIVER_MEMORY_OVERHEAD = +ConfigBuilder("spark.kubernetes.driver.memoryOverhead") + .doc("The amount of off-heap memory (in megabytes) to be allocated for the driver and the " + +"driver submission server. This is memory that accounts for things like VM overheads, " + +"interned strings, other native overheads, etc. This tends to grow with the driver's " + +"memory size (typically 6-10%).") + .bytesConf(ByteUnit.MiB) + .createOptional + + // Note that while we set a default for this when we start up the + // scheduler, the specific default value is
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154707557 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") --- End diff -- Thank you for review, @HyukjinKwon . Do you mean `Apache ORC library is more stable, but new OrcFileFormat is not` because it's introduced newly? Actually, that's true in the Spark's viewpoint, but new OrcFileFormat contains more bug fixes and new features too. If you allow, I want to keep this. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154707001 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -119,5 +139,60 @@ private[spark] object Config extends Logging { "must be a positive integer") .createWithDefault(10) + val WAIT_FOR_APP_COMPLETION = +ConfigBuilder("spark.kubernetes.submission.waitAppCompletion") + .doc("In cluster mode, whether to wait for the application to finish before exiting the " + +"launcher process.") + .booleanConf + .createWithDefault(true) + + val REPORT_INTERVAL = +ConfigBuilder("spark.kubernetes.report.interval") + .doc("Interval between reports of the current app status in cluster mode.") + .timeConf(TimeUnit.MILLISECONDS) + .createWithDefaultString("1s") + + private[spark] val JARS_DOWNLOAD_LOCATION = +ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir") + .doc("Location to download jars to in the driver and executors. When using" + +" spark-submit, this directory must be empty and will be mounted as an empty directory" + +" volume on the driver and executor pod.") + .stringConf + .createWithDefault("/var/spark-data/spark-jars") + + private[spark] val FILES_DOWNLOAD_LOCATION = +ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir") + .doc("Location to download files to in the driver and executors. When using" + +" spark-submit, this directory must be empty and will be mounted as an empty directory" + +" volume on the driver and executor pods.") + .stringConf + .createWithDefault("/var/spark-data/spark-files") + + val KUBERNETES_AUTH_SUBMISSION_CONF_PREFIX = +"spark.kubernetes.authenticate.submission" + val KUBERNETES_NODE_SELECTOR_PREFIX = "spark.kubernetes.node.selector." + + val KUBERNETES_DRIVER_LABEL_PREFIX = "spark.kubernetes.driver.label." + val KUBERNETES_DRIVER_ANNOTATION_PREFIX = "spark.kubernetes.driver.annotation." + + val KUBERNETES_EXECUTOR_LABEL_PREFIX = "spark.kubernetes.executor.label." + val KUBERNETES_EXECUTOR_ANNOTATION_PREFIX = "spark.kubernetes.executor.annotation." + + val KUBERNETES_DRIVER_ENV_KEY = "spark.kubernetes.driverEnv." + + def getK8sMasterUrl(rawMasterString: String): String = { +require(rawMasterString.startsWith("k8s://"), + "Master URL should start with k8s:// in Kubernetes mode.") +val masterWithoutK8sPrefix = rawMasterString.substring("k8s://".length) +if (masterWithoutK8sPrefix.startsWith("http://;) --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19870: [SPARK-22665][SQL] Avoid repartitioning with empt...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19870#discussion_r154705369 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -838,6 +838,8 @@ case class RepartitionByExpression( numPartitions: Int) extends RepartitionOperation { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") + require(partitionExpressions.nonEmpty, s"${getClass.getSimpleName} requires a non empty set of " + --- End diff -- what was the behavior if `partitionExpressions` is empty? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19870 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19870 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84428/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19870 **[Test build #84428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84428/testReport)** for PR 19870 at commit [`b4f985f`](https://github.com/apache/spark/commit/b4f985fc81f4664de538e7bcae0e134bce8ae735). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19869 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19869 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84432/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84432/testReport)** for PR 19869 at commit [`b1acd35`](https://github.com/apache/spark/commit/b1acd35d24ea449a97434b54759cf8ed441661f2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19717#discussion_r154694277 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverServiceBootstrapStep.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s.submit.steps + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.ServiceBuilder + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.deploy.k8s.Constants._ +import org.apache.spark.deploy.k8s.submit.KubernetesDriverSpec +import org.apache.spark.internal.Logging +import org.apache.spark.util.Clock + +/** + * Allows the driver to be reachable by executor pods through a headless service. The service's + * ports should correspond to the ports that the executor will reach the pod at for RPC. + */ +private[spark] class DriverServiceBootstrapStep( +kubernetesResourceNamePrefix: String, +driverLabels: Map[String, String], +submissionSparkConf: SparkConf, +clock: Clock) extends DriverConfigurationStep with Logging { + import DriverServiceBootstrapStep._ + + override def configureDriver(driverSpec: KubernetesDriverSpec): KubernetesDriverSpec = { +require(submissionSparkConf.getOption(DRIVER_BIND_ADDRESS_KEY).isEmpty, + s"$DRIVER_BIND_ADDRESS_KEY is not supported in Kubernetes mode, as the driver's bind " + + "address is managed and set to the driver pod's IP address.") +require(submissionSparkConf.getOption(DRIVER_HOST_KEY).isEmpty, + s"$DRIVER_HOST_KEY is not supported in Kubernetes mode, as the driver's hostname will be " + + "managed via a Kubernetes service.") + +val preferredServiceName = s"$kubernetesResourceNamePrefix$DRIVER_SVC_POSTFIX" +val resolvedServiceName = if (preferredServiceName.length <= MAX_SERVICE_NAME_LENGTH) { + preferredServiceName +} else { + val randomServiceId = clock.getTimeMillis() + val shorterServiceName = s"spark-$randomServiceId$DRIVER_SVC_POSTFIX" + logWarning(s"Driver's hostname would preferably be $preferredServiceName, but this is " + +s"too long (must be <= $MAX_SERVICE_NAME_LENGTH characters). Falling back to use " + +s"$shorterServiceName as the driver service's name.") + shorterServiceName +} + +val driverPort = submissionSparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT) +val driverBlockManagerPort = submissionSparkConf.getInt( +org.apache.spark.internal.config.DRIVER_BLOCK_MANAGER_PORT.key, DEFAULT_BLOCKMANAGER_PORT) +val driverService = new ServiceBuilder() + .withNewMetadata() +.withName(resolvedServiceName) +.endMetadata() + .withNewSpec() +.withClusterIP("None") +.withSelector(driverLabels.asJava) +.addNewPort() + .withName(DRIVER_PORT_NAME) + .withPort(driverPort) + .withNewTargetPort(driverPort) + .endPort() +.addNewPort() + .withName(BLOCK_MANAGER_PORT_NAME) + .withPort(driverBlockManagerPort) + .withNewTargetPort(driverBlockManagerPort) + .endPort() +.endSpec() + .build() + +val namespace = submissionSparkConf.get(KUBERNETES_NAMESPACE) +val driverHostname = s"${driverService.getMetadata.getName}.$namespace.svc.cluster.local" +val resolvedSparkConf = driverSpec.driverSparkConf.clone() + .set(org.apache.spark.internal.config.DRIVER_HOST_ADDRESS, driverHostname) --- End diff -- Ah, yeah, done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r154694199 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/ConfigSupport.java --- @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources.v2; + +import org.apache.spark.annotation.InterfaceStability; +import org.apache.spark.sql.sources.v2.reader.DataSourceV2Reader; + +import java.util.List; + +/** + * A mix-in interface for {@link DataSourceV2}. Data sources can implement this interface to + * propagate session configs with chosen key-prefixes to the particular data source. + */ +@InterfaceStability.Evolving +public interface ConfigSupport { + +/** + * Create a list of key-prefixes, all session configs that match at least one of the prefixes + * will be propagated to the data source options. + */ +List getConfigPrefixes(); --- End diff -- we need to think about the current use cases and validate this API. E.g. CSV data source and JSON data source both accept an option `columnNameOfCorruptRecord`, or session config `spark.sql.columnNameOfCorruptRecord`. We get the following information: 1. mostly session config maps to an existing option. 2. session configs are always prefixed with `spark.sql`, we should not ask the data source to always specify it. 3. do we really need to support more than one prefixes? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19874 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r154692737 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -184,9 +188,16 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val cls = DataSource.lookupDataSource(source) if (classOf[DataSourceV2].isAssignableFrom(cls)) { - val options = new DataSourceV2Options(extraOptions.asJava) + val dataSource = cls.newInstance() + val options = dataSource match { +case cs: ConfigSupport => + val confs = withSessionConfig(cs, sparkSession.sessionState.conf) + new DataSourceV2Options((confs ++ extraOptions).asJava) --- End diff -- yea `extraOptions` needs higher priority. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154691937 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression { input: String, result: String, fields: Array[StructField]): String = { -val hashes = fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx) } +val hashResultType = ctx.javaType(dataType) --- End diff -- `ctx` is only available inside `doGenCode` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19880 cc @cloud-fan @wangyum --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19880 **[Test build #84434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84434/testReport)** for PR 19880 at commit [`9be829d`](https://github.com/apache/spark/commit/9be829d208f7e2d6a88b9d2008fc04eec4a4ad8e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentatio...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/19880 [SPARK-22626][SQL][FOLLOWUP] improve documentation and simplify test case ## What changes were proposed in this pull request? The reason why some Hive tables have `numRows` statistics is that, in Hive, when stats gathering is disabled, `numRows` is always zero after INSERT command: ``` hive> create table src (key int, value string) stored as orc; hive> desc formatted src; Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles0 numRows 0 rawDataSize 0 totalSize 0 transient_lastDdlTime 1512399590 hive> set hive.stats.autogather=false; hive> insert into src select 1, 'a'; hive> desc formatted src; Table Parameters: numFiles1 numRows 0 rawDataSize 0 totalSize 275 transient_lastDdlTime 1512399647 hive> insert into src select 1, 'b'; hive> desc formatted src; Table Parameters: numFiles2 numRows 0 rawDataSize 0 totalSize 550 transient_lastDdlTime 1512399687 ``` ## How was this patch tested? Modified existing test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark doc_zero_rowCount Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19880.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19880 commit 9be829d208f7e2d6a88b9d2008fc04eec4a4ad8e Author: Zhenhua WangDate: 2017-12-04T15:53:49Z improve doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19861 **[Test build #84433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84433/testReport)** for PR 19861 at commit [`ec5723c`](https://github.com/apache/spark/commit/ec5723c194474c85af0c4bd6265c6f7b0781881e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19877 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19877 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84427/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19877 **[Test build #84427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84427/testReport)** for PR 19877 at commit [`882126c`](https://github.com/apache/spark/commit/882126c2671e1733d572350af9749e9f8bdca1c2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19878 left only one very minor comment, it LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19861#discussion_r154687413 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -184,9 +188,16 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { val cls = DataSource.lookupDataSource(source) if (classOf[DataSourceV2].isAssignableFrom(cls)) { - val options = new DataSourceV2Options(extraOptions.asJava) + val dataSource = cls.newInstance() + val options = dataSource match { +case cs: ConfigSupport => + val confs = withSessionConfig(cs, sparkSession.sessionState.conf) + new DataSourceV2Options((confs ++ extraOptions).asJava) --- End diff -- Good catch! Should the confs in the `extraOptions` have a higher priority? WDYT @cloud-fan ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154687417 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression { input: String, result: String, fields: Array[StructField]): String = { -val hashes = fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx) } +val hashResultType = ctx.javaType(dataType) --- End diff -- nit: this is done also in line 281. Can we do this only once? maybe with a `lazy val`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154683889 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] { input: String, result: String, fields: Array[StructField]): String = { -val localResult = ctx.freshName("localResult") val childResult = ctx.freshName("childResult") -fields.zipWithIndex.map { case (field, index) => +val fieldsHash = fields.zipWithIndex.map { case (field, index) => + val computeFieldHash = nullSafeElementHash( +input, index.toString, field.nullable, field.dataType, childResult, ctx) s""" - $childResult = 0; - ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType, - childResult, ctx)} - $localResult = (31 * $localResult) + $childResult; - """ -}.mkString( --- End diff -- We forgot to split the code for computing hive hash of struct, it's fixed now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154683716 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" -val childrenHash = ctx.splitExpressions(children.map { child => + +val childrenHash = children.map { child => val childGen = child.genCode(ctx) childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, ev.value, ctx) } -}) +} + +val hashResultType = ctx.javaType(dataType) +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { --- End diff -- I think @kiszk is doing this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84432/testReport)** for PR 19869 at commit [`b1acd35`](https://github.com/apache/spark/commit/b1acd35d24ea449a97434b54759cf8ed441661f2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19878#discussion_r154682872 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala --- @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression { override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { ev.isNull = "false" -val childrenHash = ctx.splitExpressions(children.map { child => + +val childrenHash = children.map { child => val childGen = child.genCode(ctx) childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) { computeHash(childGen.value, child.dataType, ev.value, ctx) } -}) +} + +val hashResultType = ctx.javaType(dataType) +val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) { --- End diff -- This pattern appears many times in the code base, we may need to create a `ctx.splitExpressionsWithCurrentInput` for it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154682180 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -596,7 +596,7 @@ case class HashAggregateExec( ctx.addMutableState(fastHashMapClassName, fastHashMapTerm, s"$fastHashMapTerm = new $fastHashMapClassName();") ctx.addMutableState( - classOf[java.util.Iterator[ColumnarRow]].getName, + s"java.util.Iterator<${classOf[ColumnarRow]}>", --- End diff -- damn... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19878 **[Test build #84431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)** for PR 19878 at commit [`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19879 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19865: [SPARK-22668][SQL] Do not pass global variables to argum...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19865 hopefully https://github.com/apache/spark/pull/19878 can fix the problem. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overri...
GitHub user mpetruska opened a pull request: https://github.com/apache/spark/pull/19879 [SPARK-20706][SPARK-SHELL] Spark-shell not overriding method/variable definition ## What changes were proposed in this pull request? [SPARK-20706](https://issues.apache.org/jira/browse/SPARK-20706): Spark-shell not overriding method/variable definition This is a Scala repl bug ( [SI-9740](https://github.com/scala/bug/issues/9740) ), was fixed in version 2.11.9 ( [see the original PR](https://github.com/scala/scala/pull/5090) ) ## How was this patch tested? Added a new test case in `ReplSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpetruska/spark SPARK-20706 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19879.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19879 commit 70c35fd4f7d7646586980fd6eac0d4039be1daa6 Author: Mark PetruskaDate: 2017-12-04T15:27:15Z fixes spark shell import issue (scala bug: SI-9740 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19878 cc @kiszk @mgaido91 @viirya @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19878 [SPARK-22682][SQL] HashExpression does not need to create global variables ## What changes were proposed in this pull request? It turns out that `HashExpression` can pass around some values via parameter when splitting codes into methods, to save some global variable slots. This can also prevent a weird case that global variable appears in parameter list, which is discovered by https://github.com/apache/spark/pull/19865 ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19878.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19878 commit 0e9998e0704b54d8f1352a1936c9b6367ebee15e Author: Wenchen FanDate: 2017-12-04T15:24:46Z HashExpression does not need to create global variables --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19875 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84425/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19875 **[Test build #84425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84425/testReport)** for PR 19875 at commit [`424e471`](https://github.com/apache/spark/commit/424e47175387e063a60fe06287f77703cf400045). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84430/testReport)** for PR 19873 at commit [`9f5a0e4`](https://github.com/apache/spark/commit/9f5a0e458fa0cb42d6850e16d74994af1b1a3752). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19792 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84429/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19792 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19792 **[Test build #84429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84429/testReport)** for PR 19792 at commit [`41766fa`](https://github.com/apache/spark/commit/41766fa875b987fecf910b7fa8bd9429e27ce88e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19789: [SPARK-22562][Streaming] CachedKafkaConsumer unsafe evic...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19789 I think we should wait for @zsxwing's feedback. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19792 **[Test build #84429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84429/testReport)** for PR 19792 at commit [`41766fa`](https://github.com/apache/spark/commit/41766fa875b987fecf910b7fa8bd9429e27ce88e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user gberger commented on the issue: https://github.com/apache/spark/pull/19792 Fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19870 **[Test build #84428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84428/testReport)** for PR 19870 at commit [`b4f985f`](https://github.com/apache/spark/commit/b4f985fc81f4664de538e7bcae0e134bce8ae735). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...
Github user ptkool commented on the issue: https://github.com/apache/spark/pull/18424 @a10y Yes, I'm still tracking this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19875 yes @HyukjinKwon , you are 100% right, sorry for this error. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644084 --- Diff: python/pyspark/sql/udf.py --- @@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType): return udf_obj._wrapped() +class UDFColumn(Column): --- End diff -- BTW, what do you think about adding an attribute instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154642230 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.{SparkEnv, TaskContext} +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, SortOrder, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning} +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +case class AggregateInPandasExec( +groupingAttributes: Seq[Attribute], +func: Seq[Expression], +output: Seq[Attribute], +child: SparkPlan) + extends UnaryExecNode { + private val udfs = func.map(expr => expr.asInstanceOf[PythonUDF]) + + override def outputPartitioning: Partitioning = child.outputPartitioning + + override def producedAttributes: AttributeSet = AttributeSet(output) + + override def requiredChildDistribution: Seq[Distribution] = { +if (groupingAttributes.isEmpty) { + AllTuples :: Nil +} else { + ClusteredDistribution(groupingAttributes) :: Nil +} + } + + private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = { +udf.children match { + case Seq(u: PythonUDF) => +val (chained, children) = collectFunctions(u) +(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children) + case children => +// There should not be any other UDFs, or the children can't be evaluated directly. +assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty)) +(ChainedPythonFunctions(Seq(udf.func)), udf.children) +} + } + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = +Seq(groupingAttributes.map(SortOrder(_, Ascending))) + + override protected def doExecute(): RDD[InternalRow] = { +val inputRDD = child.execute() + +val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536) +val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true) +// val argOffsets = Array((0 until (child.output.length - groupingAttributes.length)).toArray) +val schema = StructType(child.schema.drop(groupingAttributes.length)) +val sessionLocalTimeZone = conf.sessionLocalTimeZone +val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone + +val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip + +val allInputs = new ArrayBuffer[Expression] + +val argOffsets = inputs.map { input => + input.map { e => + allInputs += e --- End diff -- indentation nit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154642902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala --- @@ -38,3 +38,13 @@ case class FlatMapGroupsInPandas( */ override val producedAttributes = AttributeSet(output) } + +case class AggregateInPandas( +groupingAttributes: Seq[Attribute], +functionExprs: Seq[Expression], +output: Seq[Attribute], +child: LogicalPlan +) extends UnaryNode { --- End diff -- nit: ``` child: LogicalPlan) extends UnaryNode { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644620 --- Diff: python/pyspark/sql/tests.py --- @@ -4016,6 +4016,89 @@ def test_unsupported_types(self): with self.assertRaisesRegexp(Exception, 'Unsupported data type'): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggTests(ReusedSQLTestCase): +def assertFramesEqual(self, expected, result): +msg = ("DataFrames are not equal: " + + ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) + --- End diff -- indentation nit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644235 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql]( df.logicalPlan)) } + + private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = { +val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF]) + +val groupingNamedExpressions = groupingExprs.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() +} + +val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) + +val child = df.logicalPlan + +val childrenExpressions = exprs.flatMap(expr => + expr.children.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() --- End diff -- indentation nit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644340 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql]( df.logicalPlan)) } + + private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = { +val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF]) + +val groupingNamedExpressions = groupingExprs.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() +} + +val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) + +val child = df.logicalPlan + +val childrenExpressions = exprs.flatMap(expr => + expr.children.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() +}) + +val project = Project(groupingNamedExpressions ++ childrenExpressions, child) + +val udfOutputs = exprs.flatMap(expr => + Seq(AttributeReference(expr.name, expr.dataType)()) +) --- End diff -- I think this could be inlined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #895: [SPARK-1940] Enabling rolling of executor logs, and automa...
Github user wbowditch commented on the issue: https://github.com/apache/spark/pull/895 Can these configuration additions be added to Spark Documentation (https://spark.apache.org/docs/latest/configuration.html) ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154640805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { --- End diff -- After more thinking, I think it don't worth to pass the whole SQLConf into this function, we just need to know whether `SQLConf.ORC_USE_NEW_VERSION` is enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org