[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19861
  
**[Test build #84433 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84433/testReport)**
 for PR 19861 at commit 
[`ec5723c`](https://github.com/apache/spark/commit/ec5723c194474c85af0c4bd6265c6f7b0781881e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19865: [SPARK-22668][SQL] Do not pass global variables to argum...

2017-12-04 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19865
  
Thank you. I think so for this case.  
In general, to make global `ev.value` or to pass a global variable to 
`consume()` may potentially cause this problem.  
As @cloud-fan suggested, I identified these patterns and am fixing them.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19878
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84431/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19881: [SPARK-22683][CORE] Add tasksPerExecutorSlot parameter

2017-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19881
  
Please see JIRA. I don't think this is worth doing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19878
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19878
  
**[Test build #84431 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)**
 for PR 19878 at commit 
[`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19881: [SPARK-22683][CORE] Add tasksPerExecutorSlot parameter

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19881
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19877: [SPARK-22681]Accumulator should only be updated o...

2017-12-04 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19877#discussion_r154729236
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -1832,6 +1832,27 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 assertDataStructuresEmpty()
   }
 
+  test("accumulator not calculated for resubmitted task in result stage") {
+// just for register
--- End diff --

nit: unnecessary (and confusing?) comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19881: [SPARK-22683][CORE] Add tasksPerExecutorSlot para...

2017-12-04 Thread jcuquemelle
GitHub user jcuquemelle opened a pull request:

https://github.com/apache/spark/pull/19881

[SPARK-22683][CORE] Add tasksPerExecutorSlot parameter

## What changes were proposed in this pull request?

let's say an executor has spark.executor.cores / spark.task.cpus taskSlots
The current dynamic allocation policy allocates enough executors
to have each taskSlot execute a single task, which wastes resources when
tasks are small regarding executor allocation overhead. By adding the
tasksPerExecutorSlot, it is made possible to specify how many tasks
a single slot should ideally execute to mitigate the overhead of executor
allocation.

## How was this patch tested?
Units tests and runs on various actual workloads on a Yarn Cluster 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jcuquemelle/spark AddTaskPerExecutorSlot

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19881.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19881


commit 895882feebc53f44a70278e0b475b2fb937d331a
Author: Julien Cuquemelle 
Date:   2017-11-30T16:28:06Z

[SPARK-22683][CORE] Allow tuning the number of dynamically allocated 
executors

let's say an executor has spark.executor.cores / spark.task.cpus taskSlots

The current dynamic allocation policy allocates enough executors
to have each taskSlot execute a single task, which wastes resources when
 tasks are small regarding executor allocation overhead. By adding the
tasksPerExecutorSlot, it is made possible to specify how many tasks
a single slot should ideally execute to mitigate the overhead of executor
allocation.

commit fce3b976d0b22c4d01ef4fdd5339835bc6d6fcb1
Author: Julien Cuquemelle 
Date:   2017-11-30T16:28:06Z

[SPARK-22683][DOC] document tasksPerExecutorSlot parameter




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84438 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84438/testReport)**
 for PR 19717 at commit 
[`cfcf2a7`](https://github.com/apache/spark/commit/cfcf2a7622f2bd4a21ea2d5197ec652363645aca).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19717
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84437 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84437/testReport)**
 for PR 19717 at commit 
[`05f528a`](https://github.com/apache/spark/commit/05f528a0178fa0f567b8904780d837a4f95d888b).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19717
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84437/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84437 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84437/testReport)**
 for PR 19717 at commit 
[`05f528a`](https://github.com/apache/spark/commit/05f528a0178fa0f567b8904780d837a4f95d888b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19880
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84434/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19880
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19880
  
**[Test build #84434 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84434/testReport)**
 for PR 19880 at commit 
[`9be829d`](https://github.com/apache/spark/commit/9be829d208f7e2d6a88b9d2008fc04eec4a4ad8e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154726920
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,43 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
--- End diff --

Seems openjdk is under GPL2. Should they be listed following `This product 
optionally depends on 'Webbit'` in a similar format and with the license files 
included under `license`?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-04 Thread rezasafi
Github user rezasafi commented on the issue:

https://github.com/apache/spark/pull/19848
  
Thank you very much @vanzin, @mridulm and @jiangxb1987. I really appreciate 
it. I will create PR for branch 2.2 ASAP.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154724747
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,43 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
+
+# If this docker file is being used in the context of building your images 
from a Spark distribution, the docker build
+# command should be invoked from the top level directory of the Spark 
distribution. E.g.:
+# docker build -t spark-base:latest -f dockerfiles/spark-base/Dockerfile .
+
+RUN apk upgrade --no-cache && \
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154721032
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
 ---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.submit
+
+import java.util.concurrent.{CountDownLatch, TimeUnit}
+
+import scala.collection.JavaConverters._
+
+import io.fabric8.kubernetes.api.model.{ContainerStateRunning, 
ContainerStateTerminated, ContainerStateWaiting, ContainerStatus, Pod, Time}
+import io.fabric8.kubernetes.client.{KubernetesClientException, Watcher}
+import io.fabric8.kubernetes.client.Watcher.Action
+
+import org.apache.spark.SparkException
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.ThreadUtils
+
+private[k8s] trait LoggingPodStatusWatcher extends Watcher[Pod] {
+  def awaitCompletion(): Unit
+}
+
+/**
+ * A monitor for the running Kubernetes pod of a Spark application. Status 
logging occurs on
+ * every state change and also at an interval for liveness.
+ *
+ * @param appId application ID.
+ * @param maybeLoggingInterval ms between each state request. If provided, 
must be a positive
+ * number.
+ */
+private[k8s] class LoggingPodStatusWatcherImpl(
+appId: String,
+maybeLoggingInterval: Option[Long])
+  extends LoggingPodStatusWatcher with Logging {
+
+  private val podCompletedFuture = new CountDownLatch(1)
+  // start timer for periodic logging
+  private val scheduler =
+
ThreadUtils.newDaemonSingleThreadScheduledExecutor("logging-pod-status-watcher")
+  private val logRunnable: Runnable = new Runnable {
+override def run() = logShortStatus()
+  }
+
+  private var pod = Option.empty[Pod]
+
+  private def phase: String = 
pod.map(_.getStatus.getPhase).getOrElse("unknown")
+
+  def start(): Unit = {
+maybeLoggingInterval.foreach { interval =>
+  require(interval > 0, s"Logging interval must be a positive time 
value, got: $interval ms.")
+  scheduler.scheduleAtFixedRate(logRunnable, 0, interval, 
TimeUnit.MILLISECONDS)
+}
+  }
+
+  override def eventReceived(action: Action, pod: Pod): Unit = {
+this.pod = Option(pod)
+action match {
+  case Action.DELETED | Action.ERROR =>
+closeWatch()
+
+  case _ =>
+logLongStatus()
+if (hasCompleted()) {
+  closeWatch()
+}
+}
+  }
+
+  override def onClose(e: KubernetesClientException): Unit = {
+logDebug(s"Stopping watching application $appId with last-observed 
phase $phase")
+closeWatch()
+  }
+
+  private def logShortStatus() = {
+logInfo(s"Application status for $appId (phase: $phase)")
+  }
+
+  private def logLongStatus() = {
+logInfo("State changed, new state: " + 
pod.map(formatPodState).getOrElse("unknown"))
+  }
+
+  private def hasCompleted(): Boolean = {
+phase == "Succeeded" || phase == "Failed"
+  }
+
+  private def closeWatch(): Unit = {
+podCompletedFuture.countDown()
+scheduler.shutdown()
+  }
+
+  private def formatPodState(pod: Pod): String = {
+// TODO include specific container state
--- End diff --

Actually it already includes the `containerStatuses`. Removed this TODO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19840
  
I'm trying to understand what is 
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L191 
really achieving. It seems pretty broken to me and feels like the whole 
`pythonExec` tracking in the various places should be removed.

It causes this problem because it forces the executor to use the driver's 
python even if it's been set to a different path by the user.

It uses `python` instead of `sys.executable` as the default value.

And it ignores the `spark.pyspark.python` config value if it's set.

Instead, shouldn't the logic at 
https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L304
 be used in `PythonRunner` (except for the driver python config) to find out 
the executor's python to use?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154719522
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark-base/Dockerfile 
---
@@ -0,0 +1,43 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FROM openjdk:8-alpine
+
+# If this docker file is being used in the context of building your images 
from a Spark distribution, the docker build
+# command should be invoked from the top level directory of the Spark 
distribution. E.g.:
+# docker build -t spark-base:latest -f dockerfiles/spark-base/Dockerfile .
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19870
  
To ensure no external behavior change, we should use RoundRobinPartitioning 
when the expression list is empty, instead of issuing an exception. It will be 
also consistent with the existing behavior of `df.repartition(numPartitions)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19848: [SPARK-22162] Executors and the driver should use...

2017-12-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19848


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-04 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19848
  
(There was a conflict in 2.2, open a new PR if you want it there.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2017-12-04 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/19848
  
Merging to master / 2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19880
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19873
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19873
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84430/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19879
  
**[Test build #4003 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4003/testReport)**
 for PR 19879 at commit 
[`f265daa`](https://github.com/apache/spark/commit/f265daa68ba38f00aa274d1d431fe74e974f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19873
  
**[Test build #84430 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84430/testReport)**
 for PR 19873 at commit 
[`9f5a0e4`](https://github.com/apache/spark/commit/9f5a0e458fa0cb42d6850e16d74994af1b1a3752).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class AnalysisBarrier(child: LogicalPlan) extends LeafNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19879
  
**[Test build #4003 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4003/testReport)**
 for PR 19879 at commit 
[`f265daa`](https://github.com/apache/spark/commit/f265daa68ba38f00aa274d1d431fe74e974f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...

2017-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19879
  
OK, so another back-port to fix the 2.11 REPL, which should be resolved in 
2.12 anyway?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated D...

2017-12-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19875


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19854: SPARK-22660:Use position() and limit() to fix ambiguity ...

2017-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19854
  
Ping @kellyzly 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84436 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84436/testReport)**
 for PR 19871 at commit 
[`2393e1d`](https://github.com/apache/spark/commit/2393e1de729441b27bc5cdd83804071f14d77a4b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19875
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19871
  
@HyukjinKwon , for enabling the following test, I'm restructuring ORC tests 
now. I'll make a PR today for that.
```
ignore("LZO compression options for writing to an ORC file not supported in 
Hive 1.2.1") {
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #10621: [SPARK-12617][PySpark]Move Py4jCallbackConnectionCleaner...

2017-12-04 Thread cpalomaressbd
Github user cpalomaressbd commented on the issue:

https://github.com/apache/spark/pull/10621
  
Hi,

One question, maybe is stupid question I dont know and I ask sorry in 
advance. We are workin with hortonworks, in the beginning of the project we 
were working with HDP 2.4.0 which one has the spark 1.6.0 and wich one has a 
bug in pyspark, this bug wich one you say resolved in:

**zsxwing commented on 22 Feb 2016**

The first solution we have done was upgrade to HDP 2.4.3 wich one has spark 
1.6.2 and wich one teorically must to have the patch to this problem, but 
Surprise we have the same bug without the patch.

Ok, Maybe the problem is with hortonworks and I should write in other 
forum, but I was in the oficial website of spark:

https://spark.apache.org/downloads.html

If you choose the oficial version of spark 1.6.2 and you download this 
version, wich one was released June 25 2016, if I go to pyspark file, the bug 
continue in this installation I am sorry but I dont understadn and We are 
desperated with this situation.

Should I appply the patch in source and recompile the code? How can I do it 
step to step?

Thanks in Advance.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154709256
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
--- End diff --

So, are you suggesting `lookupDataSource(provider, useNewOrc=true)`, 
@jiangxb1987 ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19869
  
**[Test build #84435 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84435/testReport)**
 for PR 19869 at commit 
[`f0320d5`](https://github.com/apache/spark/commit/f0320d5f7b3406334a5250ecc2d9c5248142f34d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154707683
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -85,7 +87,8 @@ case class DataSource(
 
   case class SourceInfo(name: String, schema: StructType, 
partitionColumns: Seq[String])
 
-  lazy val providingClass: Class[_] = 
DataSource.lookupDataSource(className)
+  lazy val providingClass: Class[_] =
+DataSource.lookupDataSource(sparkSession.sessionState.conf, className)
--- End diff --

Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154707798
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s
+
+import java.util.concurrent.TimeUnit
+
+import org.apache.spark.{SPARK_VERSION => sparkVersion}
+import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config.ConfigBuilder
+import org.apache.spark.network.util.ByteUnit
+
+private[spark] object Config extends Logging {
+
+  val KUBERNETES_NAMESPACE =
+ConfigBuilder("spark.kubernetes.namespace")
+  .doc("The namespace that will be used for running the driver and 
executor pods. When using " +
+"spark-submit in cluster mode, this can also be passed to 
spark-submit via the " +
+"--kubernetes-namespace command line argument.")
+  .stringConf
+  .createWithDefault("default")
+
+  val DRIVER_DOCKER_IMAGE =
+ConfigBuilder("spark.kubernetes.driver.docker.image")
+  .doc("Docker image to use for the driver. Specify this using the 
standard Docker tag format.")
+  .stringConf
+  .createWithDefault(s"spark-driver:$sparkVersion")
+
+  val EXECUTOR_DOCKER_IMAGE =
+ConfigBuilder("spark.kubernetes.executor.docker.image")
+  .doc("Docker image to use for the executors. Specify this using the 
standard Docker tag " +
+"format.")
+  .stringConf
+  .createWithDefault(s"spark-executor:$sparkVersion")
+
+  val DOCKER_IMAGE_PULL_POLICY =
+ConfigBuilder("spark.kubernetes.docker.image.pullPolicy")
+  .doc("Docker image pull policy when pulling any docker image in 
Kubernetes integration")
+  .stringConf
+  .createWithDefault("IfNotPresent")
+
+
+  val KUBERNETES_AUTH_DRIVER_CONF_PREFIX =
+  "spark.kubernetes.authenticate.driver"
+  val KUBERNETES_AUTH_DRIVER_MOUNTED_CONF_PREFIX =
+  "spark.kubernetes.authenticate.driver.mounted"
+  val OAUTH_TOKEN_CONF_SUFFIX = "oauthToken"
+  val OAUTH_TOKEN_FILE_CONF_SUFFIX = "oauthTokenFile"
+  val CLIENT_KEY_FILE_CONF_SUFFIX = "clientKeyFile"
+  val CLIENT_CERT_FILE_CONF_SUFFIX = "clientCertFile"
+  val CA_CERT_FILE_CONF_SUFFIX = "caCertFile"
+
+  val KUBERNETES_SERVICE_ACCOUNT_NAME =
+
ConfigBuilder(s"$KUBERNETES_AUTH_DRIVER_CONF_PREFIX.serviceAccountName")
+  .doc("Service account that is used when running the driver pod. The 
driver pod uses " +
+"this service account when requesting executor pods from the API 
server. If specific " +
+"credentials are given for the driver pod to use, the driver will 
favor " +
+"using those credentials instead.")
+  .stringConf
+  .createOptional
+
+  val KUBERNETES_DRIVER_LIMIT_CORES =
+ConfigBuilder("spark.kubernetes.driver.limit.cores")
+  .doc("Specify the hard cpu limit for the driver pod")
+  .stringConf
+  .createOptional
+
+  val KUBERNETES_EXECUTOR_LIMIT_CORES =
+ConfigBuilder("spark.kubernetes.executor.limit.cores")
+  .doc("Specify the hard cpu limit for a single executor pod")
+  .stringConf
+  .createOptional
+
+  val KUBERNETES_DRIVER_MEMORY_OVERHEAD =
+ConfigBuilder("spark.kubernetes.driver.memoryOverhead")
+  .doc("The amount of off-heap memory (in megabytes) to be allocated 
for the driver and the " +
+"driver submission server. This is memory that accounts for things 
like VM overheads, " +
+"interned strings, other native overheads, etc. This tends to grow 
with the driver's " +
+"memory size (typically 6-10%).")
+  .bytesConf(ByteUnit.MiB)
+  .createOptional
+
+  // Note that while we set a default for this when we start up the
+  // scheduler, the specific default value is 

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154707557
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
--- End diff --

Thank you for review, @HyukjinKwon . 
Do you mean `Apache ORC library is more stable, but new OrcFileFormat is 
not` because it's introduced newly?
Actually, that's true in the Spark's viewpoint, but new OrcFileFormat 
contains more bug fixes and new features too. If you allow, I want to keep 
this. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154707001
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -119,5 +139,60 @@ private[spark] object Config extends Logging {
 "must be a positive integer")
   .createWithDefault(10)
 
+  val WAIT_FOR_APP_COMPLETION =
+ConfigBuilder("spark.kubernetes.submission.waitAppCompletion")
+  .doc("In cluster mode, whether to wait for the application to finish 
before exiting the " +
+"launcher process.")
+  .booleanConf
+  .createWithDefault(true)
+
+  val REPORT_INTERVAL =
+ConfigBuilder("spark.kubernetes.report.interval")
+  .doc("Interval between reports of the current app status in cluster 
mode.")
+  .timeConf(TimeUnit.MILLISECONDS)
+  .createWithDefaultString("1s")
+
+  private[spark] val JARS_DOWNLOAD_LOCATION =
+ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir")
+  .doc("Location to download jars to in the driver and executors. When 
using" +
+" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
+" volume on the driver and executor pod.")
+  .stringConf
+  .createWithDefault("/var/spark-data/spark-jars")
+
+  private[spark] val FILES_DOWNLOAD_LOCATION =
+ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir")
+  .doc("Location to download files to in the driver and executors. 
When using" +
+" spark-submit, this directory must be empty and will be mounted 
as an empty directory" +
+" volume on the driver and executor pods.")
+  .stringConf
+  .createWithDefault("/var/spark-data/spark-files")
+
+  val KUBERNETES_AUTH_SUBMISSION_CONF_PREFIX =
+"spark.kubernetes.authenticate.submission"
+
   val KUBERNETES_NODE_SELECTOR_PREFIX = "spark.kubernetes.node.selector."
+
+  val KUBERNETES_DRIVER_LABEL_PREFIX = "spark.kubernetes.driver.label."
+  val KUBERNETES_DRIVER_ANNOTATION_PREFIX = 
"spark.kubernetes.driver.annotation."
+
+  val KUBERNETES_EXECUTOR_LABEL_PREFIX = "spark.kubernetes.executor.label."
+  val KUBERNETES_EXECUTOR_ANNOTATION_PREFIX = 
"spark.kubernetes.executor.annotation."
+
+  val KUBERNETES_DRIVER_ENV_KEY = "spark.kubernetes.driverEnv."
+
+  def getK8sMasterUrl(rawMasterString: String): String = {
+require(rawMasterString.startsWith("k8s://"),
+  "Master URL should start with k8s:// in Kubernetes mode.")
+val masterWithoutK8sPrefix = rawMasterString.substring("k8s://".length)
+if (masterWithoutK8sPrefix.startsWith("http://;)
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19870: [SPARK-22665][SQL] Avoid repartitioning with empt...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19870#discussion_r154705369
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -838,6 +838,8 @@ case class RepartitionByExpression(
 numPartitions: Int) extends RepartitionOperation {
 
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must 
be positive.")
+  require(partitionExpressions.nonEmpty, s"${getClass.getSimpleName} 
requires a non empty set of " +
--- End diff --

what was the behavior if `partitionExpressions` is empty?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19870
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19870
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84428/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19870
  
**[Test build #84428 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84428/testReport)**
 for PR 19870 at commit 
[`b4f985f`](https://github.com/apache/spark/commit/b4f985fc81f4664de538e7bcae0e134bce8ae735).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19869
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19869
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84432/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19869
  
**[Test build #84432 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84432/testReport)**
 for PR 19869 at commit 
[`b1acd35`](https://github.com/apache/spark/commit/b1acd35d24ea449a97434b54759cf8ed441661f2).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154694277
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/DriverServiceBootstrapStep.scala
 ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.submit.steps
+
+import scala.collection.JavaConverters._
+
+import io.fabric8.kubernetes.api.model.ServiceBuilder
+
+import org.apache.spark.SparkConf
+import org.apache.spark.deploy.k8s.Config._
+import org.apache.spark.deploy.k8s.Constants._
+import org.apache.spark.deploy.k8s.submit.KubernetesDriverSpec
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Clock
+
+/**
+ * Allows the driver to be reachable by executor pods through a headless 
service. The service's
+ * ports should correspond to the ports that the executor will reach the 
pod at for RPC.
+ */
+private[spark] class DriverServiceBootstrapStep(
+kubernetesResourceNamePrefix: String,
+driverLabels: Map[String, String],
+submissionSparkConf: SparkConf,
+clock: Clock) extends DriverConfigurationStep with Logging {
+  import DriverServiceBootstrapStep._
+
+  override def configureDriver(driverSpec: KubernetesDriverSpec): 
KubernetesDriverSpec = {
+require(submissionSparkConf.getOption(DRIVER_BIND_ADDRESS_KEY).isEmpty,
+  s"$DRIVER_BIND_ADDRESS_KEY is not supported in Kubernetes mode, as 
the driver's bind " +
+  "address is managed and set to the driver pod's IP address.")
+require(submissionSparkConf.getOption(DRIVER_HOST_KEY).isEmpty,
+  s"$DRIVER_HOST_KEY is not supported in Kubernetes mode, as the 
driver's hostname will be " +
+  "managed via a Kubernetes service.")
+
+val preferredServiceName = 
s"$kubernetesResourceNamePrefix$DRIVER_SVC_POSTFIX"
+val resolvedServiceName = if (preferredServiceName.length <= 
MAX_SERVICE_NAME_LENGTH) {
+  preferredServiceName
+} else {
+  val randomServiceId = clock.getTimeMillis()
+  val shorterServiceName = s"spark-$randomServiceId$DRIVER_SVC_POSTFIX"
+  logWarning(s"Driver's hostname would preferably be 
$preferredServiceName, but this is " +
+s"too long (must be <= $MAX_SERVICE_NAME_LENGTH characters). 
Falling back to use " +
+s"$shorterServiceName as the driver service's name.")
+  shorterServiceName
+}
+
+val driverPort = submissionSparkConf.getInt("spark.driver.port", 
DEFAULT_DRIVER_PORT)
+val driverBlockManagerPort = submissionSparkConf.getInt(
+org.apache.spark.internal.config.DRIVER_BLOCK_MANAGER_PORT.key, 
DEFAULT_BLOCKMANAGER_PORT)
+val driverService = new ServiceBuilder()
+  .withNewMetadata()
+.withName(resolvedServiceName)
+.endMetadata()
+  .withNewSpec()
+.withClusterIP("None")
+.withSelector(driverLabels.asJava)
+.addNewPort()
+  .withName(DRIVER_PORT_NAME)
+  .withPort(driverPort)
+  .withNewTargetPort(driverPort)
+  .endPort()
+.addNewPort()
+  .withName(BLOCK_MANAGER_PORT_NAME)
+  .withPort(driverBlockManagerPort)
+  .withNewTargetPort(driverBlockManagerPort)
+  .endPort()
+.endSpec()
+  .build()
+
+val namespace = submissionSparkConf.get(KUBERNETES_NAMESPACE)
+val driverHostname = 
s"${driverService.getMetadata.getName}.$namespace.svc.cluster.local"
+val resolvedSparkConf = driverSpec.driverSparkConf.clone()
+  .set(org.apache.spark.internal.config.DRIVER_HOST_ADDRESS, 
driverHostname)
--- End diff --

Ah, yeah, done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r154694199
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/ConfigSupport.java ---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;
+
+import org.apache.spark.annotation.InterfaceStability;
+import org.apache.spark.sql.sources.v2.reader.DataSourceV2Reader;
+
+import java.util.List;
+
+/**
+ * A mix-in interface for {@link DataSourceV2}. Data sources can implement 
this interface to
+ * propagate session configs with chosen key-prefixes to the particular 
data source.
+ */
+@InterfaceStability.Evolving
+public interface ConfigSupport {
+
+/**
+ * Create a list of key-prefixes, all session configs that match at 
least one of the prefixes
+ * will be propagated to the data source options.
+ */
+List getConfigPrefixes();
--- End diff --

we need to think about the current use cases and validate this API. E.g. 
CSV data source and JSON data source both accept an option 
`columnNameOfCorruptRecord`, or session config 
`spark.sql.columnNameOfCorruptRecord`. We get the following information:

1. mostly session config maps to an existing option.
2. session configs are always prefixed with `spark.sql`, we should not ask 
the data source to always specify it.
3. do we really need to support more than one prefixes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19874
  
cc @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r154692737
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -184,9 +188,16 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 
 val cls = DataSource.lookupDataSource(source)
 if (classOf[DataSourceV2].isAssignableFrom(cls)) {
-  val options = new DataSourceV2Options(extraOptions.asJava)
+  val dataSource = cls.newInstance()
+  val options = dataSource match {
+case cs: ConfigSupport =>
+  val confs = withSessionConfig(cs, sparkSession.sessionState.conf)
+  new DataSourceV2Options((confs ++ extraOptions).asJava)
--- End diff --

yea `extraOptions` needs higher priority.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19878#discussion_r154691937
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 ---
@@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression {
   input: String,
   result: String,
   fields: Array[StructField]): String = {
-val hashes = fields.zipWithIndex.map { case (field, index) =>
+val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
   nullSafeElementHash(input, index.toString, field.nullable, 
field.dataType, result, ctx)
 }
+val hashResultType = ctx.javaType(dataType)
--- End diff --

`ctx` is only available inside `doGenCode`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...

2017-12-04 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19880
  
cc @cloud-fan @wangyum 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentation and s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19880
  
**[Test build #84434 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84434/testReport)**
 for PR 19880 at commit 
[`9be829d`](https://github.com/apache/spark/commit/9be829d208f7e2d6a88b9d2008fc04eec4a4ad8e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19880: [SPARK-22626][SQL][FOLLOWUP] improve documentatio...

2017-12-04 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/19880

[SPARK-22626][SQL][FOLLOWUP] improve documentation and simplify test case

## What changes were proposed in this pull request?

The reason why some Hive tables have `numRows` statistics is that, in Hive, 
when stats gathering is disabled, `numRows` is always zero after INSERT command:
```
hive> create table src (key int, value string) stored as orc;
hive> desc formatted src;
Table Parameters:
COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
numFiles0   
numRows 0   
rawDataSize 0   
totalSize   0   
transient_lastDdlTime   1512399590 

hive> set hive.stats.autogather=false;
hive> insert into src select 1, 'a';
hive> desc formatted src;
Table Parameters:
numFiles1   
numRows 0   
rawDataSize 0   
totalSize   275 
transient_lastDdlTime   1512399647 

hive> insert into src select 1, 'b';
hive> desc formatted src;
Table Parameters:
numFiles2   
numRows 0   
rawDataSize 0   
totalSize   550 
transient_lastDdlTime   1512399687 
```

## How was this patch tested?

Modified existing test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark doc_zero_rowCount

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19880.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19880


commit 9be829d208f7e2d6a88b9d2008fc04eec4a4ad8e
Author: Zhenhua Wang 
Date:   2017-12-04T15:53:49Z

improve doc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19861: [SPARK-22387][SQL] Propagate session configs to data sou...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19861
  
**[Test build #84433 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84433/testReport)**
 for PR 19861 at commit 
[`ec5723c`](https://github.com/apache/spark/commit/ec5723c194474c85af0c4bd6265c6f7b0781881e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19877
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19877
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84427/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19877
  
**[Test build #84427 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84427/testReport)**
 for PR 19877 at commit 
[`882126c`](https://github.com/apache/spark/commit/882126c2671e1733d572350af9749e9f8bdca1c2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19878
  
left only one very minor comment, it LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19861: [SPARK-22387][SQL] Propagate session configs to d...

2017-12-04 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19861#discussion_r154687413
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -184,9 +188,16 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 
 val cls = DataSource.lookupDataSource(source)
 if (classOf[DataSourceV2].isAssignableFrom(cls)) {
-  val options = new DataSourceV2Options(extraOptions.asJava)
+  val dataSource = cls.newInstance()
+  val options = dataSource match {
+case cs: ConfigSupport =>
+  val confs = withSessionConfig(cs, sparkSession.sessionState.conf)
+  new DataSourceV2Options((confs ++ extraOptions).asJava)
--- End diff --

Good catch! Should the confs in the `extraOptions` have a higher priority? 
WDYT @cloud-fan ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19878#discussion_r154687417
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 ---
@@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression {
   input: String,
   result: String,
   fields: Array[StructField]): String = {
-val hashes = fields.zipWithIndex.map { case (field, index) =>
+val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
   nullSafeElementHash(input, index.toString, field.nullable, 
field.dataType, result, ctx)
 }
+val hashResultType = ctx.javaType(dataType)
--- End diff --

nit: this is done also in line 281. Can we do this only once? maybe with a 
`lazy val`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19878#discussion_r154683889
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 ---
@@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) 
extends HashExpression[Int] {
   input: String,
   result: String,
   fields: Array[StructField]): String = {
-val localResult = ctx.freshName("localResult")
 val childResult = ctx.freshName("childResult")
-fields.zipWithIndex.map { case (field, index) =>
+val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
+  val computeFieldHash = nullSafeElementHash(
+input, index.toString, field.nullable, field.dataType, 
childResult, ctx)
   s"""
- $childResult = 0;
- ${nullSafeElementHash(input, index.toString, field.nullable, 
field.dataType,
-   childResult, ctx)}
- $localResult = (31 * $localResult) + $childResult;
-   """
-}.mkString(
--- End diff --

We forgot to split the code for computing hive hash of struct, it's fixed 
now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19878#discussion_r154683716
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 ---
@@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 ev.isNull = "false"
-val childrenHash = ctx.splitExpressions(children.map { child =>
+
+val childrenHash = children.map { child =>
   val childGen = child.genCode(ctx)
   childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
 computeHash(childGen.value, child.dataType, ev.value, ctx)
   }
-})
+}
+
+val hashResultType = ctx.javaType(dataType)
+val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
--- End diff --

I think @kiszk is doing this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19869
  
**[Test build #84432 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84432/testReport)**
 for PR 19869 at commit 
[`b1acd35`](https://github.com/apache/spark/commit/b1acd35d24ea449a97434b54759cf8ed441661f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19878#discussion_r154682872
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 ---
@@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 ev.isNull = "false"
-val childrenHash = ctx.splitExpressions(children.map { child =>
+
+val childrenHash = children.map { child =>
   val childGen = child.genCode(ctx)
   childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
 computeHash(childGen.value, child.dataType, ev.value, ctx)
   }
-})
+}
+
+val hashResultType = ctx.javaType(dataType)
+val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
--- End diff --

This pattern appears many times in the code base, we may need to create a 
`ctx.splitExpressionsWithCurrentInput` for it later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19869#discussion_r154682180
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -596,7 +596,7 @@ case class HashAggregateExec(
 ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
   s"$fastHashMapTerm = new $fastHashMapClassName();")
 ctx.addMutableState(
-  classOf[java.util.Iterator[ColumnarRow]].getName,
+  s"java.util.Iterator<${classOf[ColumnarRow]}>",
--- End diff --

damn...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19878
  
**[Test build #84431 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)**
 for PR 19878 at commit 
[`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overriding me...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19879
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19865: [SPARK-22668][SQL] Do not pass global variables to argum...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19865
  
hopefully https://github.com/apache/spark/pull/19878 can fix the problem.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19879: [SPARK-20706][SPARK-SHELL] Spark-shell not overri...

2017-12-04 Thread mpetruska
GitHub user mpetruska opened a pull request:

https://github.com/apache/spark/pull/19879

[SPARK-20706][SPARK-SHELL] Spark-shell not overriding method/variable 
definition

## What changes were proposed in this pull request?

[SPARK-20706](https://issues.apache.org/jira/browse/SPARK-20706): 
Spark-shell not overriding method/variable definition
This is a Scala repl bug ( 
[SI-9740](https://github.com/scala/bug/issues/9740) ), was fixed in version 
2.11.9 ( [see the original PR](https://github.com/scala/scala/pull/5090) )

## How was this patch tested?

Added a new test case in `ReplSuite`.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpetruska/spark SPARK-20706

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19879.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19879


commit 70c35fd4f7d7646586980fd6eac0d4039be1daa6
Author: Mark Petruska 
Date:   2017-12-04T15:27:15Z

fixes spark shell import issue (scala bug: SI-9740




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19878
  
cc @kiszk @mgaido91 @viirya @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19878

[SPARK-22682][SQL] HashExpression does not need to create global variables

## What changes were proposed in this pull request?

It turns out that `HashExpression` can pass around some values via 
parameter when splitting codes into methods, to save some global variable slots.

This can also prevent a weird case that global variable appears in 
parameter list, which is discovered by 
https://github.com/apache/spark/pull/19865

## How was this patch tested?

existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19878


commit 0e9998e0704b54d8f1352a1936c9b6367ebee15e
Author: Wenchen Fan 
Date:   2017-12-04T15:24:46Z

HashExpression does not need to create global variables




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19875
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19875
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84425/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19875
  
**[Test build #84425 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84425/testReport)**
 for PR 19875 at commit 
[`424e471`](https://github.com/apache/spark/commit/424e47175387e063a60fe06287f77703cf400045).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19873
  
**[Test build #84430 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84430/testReport)**
 for PR 19873 at commit 
[`9f5a0e4`](https://github.com/apache/spark/commit/9f5a0e458fa0cb42d6850e16d74994af1b1a3752).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19792
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84429/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19792
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19792
  
**[Test build #84429 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84429/testReport)**
 for PR 19792 at commit 
[`41766fa`](https://github.com/apache/spark/commit/41766fa875b987fecf910b7fa8bd9429e27ce88e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19789: [SPARK-22562][Streaming] CachedKafkaConsumer unsafe evic...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19789
  
I think we should wait for @zsxwing's feedback.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19792
  
**[Test build #84429 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84429/testReport)**
 for PR 19792 at commit 
[`41766fa`](https://github.com/apache/spark/commit/41766fa875b987fecf910b7fa8bd9429e27ce88e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread gberger
Github user gberger commented on the issue:

https://github.com/apache/spark/pull/19792
  
Fixed



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19870: [SPARK-22665][SQL] Avoid repartitioning with empty list ...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19870
  
**[Test build #84428 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84428/testReport)**
 for PR 19870 at commit 
[`b4f985f`](https://github.com/apache/spark/commit/b4f985fc81f4664de538e7bcae0e134bce8ae735).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...

2017-12-04 Thread ptkool
Github user ptkool commented on the issue:

https://github.com/apache/spark/pull/18424
  
@a10y Yes, I'm still tracking this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19875
  
yes @HyukjinKwon , you are 100% right, sorry for this error.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644084
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
 return udf_obj._wrapped()
 
 
+class UDFColumn(Column):
--- End diff --

BTW, what do you think about adding an attribute instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154642230
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
AttributeSet, Expression, JoinedRow, SortOrder, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class AggregateInPandasExec(
+groupingAttributes: Seq[Attribute],
+func: Seq[Expression],
+output: Seq[Attribute],
+child: SparkPlan)
+  extends UnaryExecNode {
+  private val udfs = func.map(expr => expr.asInstanceOf[PythonUDF])
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingAttributes) :: Nil
+}
+  }
+
+  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, 
Seq[Expression]) = {
+udf.children match {
+  case Seq(u: PythonUDF) =>
+val (chained, children) = collectFunctions(u)
+(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
+  case children =>
+// There should not be any other UDFs, or the children can't be 
evaluated directly.
+assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
+(ChainedPythonFunctions(Seq(udf.func)), udf.children)
+}
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+// val argOffsets = Array((0 until (child.output.length - 
groupingAttributes.length)).toArray)
+val schema = StructType(child.schema.drop(groupingAttributes.length))
+val sessionLocalTimeZone = conf.sessionLocalTimeZone
+val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
+
+val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip
+
+val allInputs = new ArrayBuffer[Expression]
+
+val argOffsets = inputs.map { input =>
+  input.map { e =>
+  allInputs += e
--- End diff --

indentation nit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154642902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala
 ---
@@ -38,3 +38,13 @@ case class FlatMapGroupsInPandas(
*/
   override val producedAttributes = AttributeSet(output)
 }
+
+case class AggregateInPandas(
+groupingAttributes: Seq[Attribute],
+functionExprs: Seq[Expression],
+output: Seq[Attribute],
+child: LogicalPlan
+) extends UnaryNode {
--- End diff --

nit:

```
child: LogicalPlan) extends UnaryNode {
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644620
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4016,6 +4016,89 @@ def test_unsupported_types(self):
 with self.assertRaisesRegexp(Exception, 'Unsupported data 
type'):
 df.groupby('id').apply(f).collect()
 
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyAggTests(ReusedSQLTestCase):
+def assertFramesEqual(self, expected, result):
+msg = ("DataFrames are not equal: " +
+   ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
--- End diff --

indentation nit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644235
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
   df.logicalPlan))
   }
 
+
+  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
+val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
+
+val groupingNamedExpressions = groupingExprs.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
+}
+
+val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
+
+val child = df.logicalPlan
+
+val childrenExpressions = exprs.flatMap(expr =>
+  expr.children.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
--- End diff --

indentation nit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644340
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
   df.logicalPlan))
   }
 
+
+  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
+val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
+
+val groupingNamedExpressions = groupingExprs.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
+}
+
+val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
+
+val child = df.logicalPlan
+
+val childrenExpressions = exprs.flatMap(expr =>
+  expr.children.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
+})
+
+val project = Project(groupingNamedExpressions ++ childrenExpressions, 
child)
+
+val udfOutputs = exprs.flatMap(expr =>
+  Seq(AttributeReference(expr.name, expr.dataType)())
+)
--- End diff --

I think this could be inlined.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #895: [SPARK-1940] Enabling rolling of executor logs, and automa...

2017-12-04 Thread wbowditch
Github user wbowditch commented on the issue:

https://github.com/apache/spark/pull/895
  
Can these configuration additions be added to Spark Documentation 
(https://spark.apache.org/docs/latest/configuration.html) ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154640805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
--- End diff --

After more thinking, I think it don't worth to pass the whole SQLConf into 
this function, we just need to know whether `SQLConf.ORC_USE_NEW_VERSION` is 
enabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   >