date:20180827

[GitHub] spark issue #22246: [SPARK-25235] [SHELL] Merge the REPL code in Scala 2.11 ...

2018-08-27 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/22246
  
Ping @srowen 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22246: [SPARK-25235] [SHELL] Merge the REPL code in Scal...

2018-08-27 Thread dbtsai

GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/22246

[SPARK-25235] [SHELL] Merge the REPL code in Scala 2.11 and 2.12 branches

## What changes were proposed in this pull request?

Using some reflection tricks to merge Scala 2.11 and 2.12 codebase.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark repl

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22246.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22246


commit 8669c21e1dc97e660a13b1cad598d1dbe8e44731
Author: DB Tsai 
Date:   2018-08-24T00:39:07Z

Consolidated Scala 2.11 and 2.12 branches

commit 3808f02fdc2d914f7a022d00884034d8d8ceb19f
Author: Liang-Chi Hsieh 
Date:   2018-08-27T11:26:29Z

Get static loader object and invoke method on it.

commit 075ca4a0c25503e4df4bc880f6ea58ead2eabcbe
Author: DB Tsai 
Date:   2018-08-27T17:54:50Z

Changed message




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread icexelloss

Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/22244
  
@cloud-fan Thanks! I will take a look later today and incorporate this with 
my patch. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22208: [SPARK-25216][SQL] Improve error message when a column c...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22208
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22208: [SPARK-25216][SQL] Improve error message when a column c...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22208
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2580/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22233: [SPARK-25240][SQL] Fix for a deadlock in RECOVER ...

2018-08-27 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22233#discussion_r213057049
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -671,7 +674,7 @@ case class AlterTableRecoverPartitionsCommand(
 val value = ExternalCatalogUtils.unescapePathName(ps(1))
 if (resolver(columnName, partitionNames.head)) {
   scanPartitions(spark, fs, filter, st.getPath, spec ++ 
Map(partitionNames.head -> value),
-partitionNames.drop(1), threshold, resolver)
+partitionNames.drop(1), threshold, resolver, 
listFilesInParallel = false)
--- End diff --

cc @zsxwing had a few offline comments about the original PR for `parmap`. 
He will post them soon. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22173
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95300/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22173
  
**[Test build #95300 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95300/testReport)**
 for PR 22173 at commit 
[`d86503c`](https://github.com/apache/spark/commit/d86503cf34f66d7082df8677e78f5f793e1064a0).
 * This patch **fails Java style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22173
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

2018-08-27 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/21976#discussion_r213056190
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2474,19 +2478,21 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 runEvent(makeCompletionEvent(
   taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
 
-// There should be no new attempt of stage submitted,
-// because task(stageId=1, stageAttempt=1, partitionId=1) is still 
running in
-// the current attempt (and hasn't completed successfully in any 
earlier attempts).
-assert(taskSets.size === 4)
+// At this point there should be no active task set for stageId=1 and 
we need
+// to resubmit because the output from (stageId=1, stageAttemptId=0, 
partitionId=1)
+// was ignored due to executor failure
+assert(taskSets.size === 5)
+assert(taskSets(4).stageId === 1 && taskSets(4).stageAttemptId === 2
+  && taskSets(4).tasks.size === 1)
 
-// Complete task(stageId=1, stageAttempt=1, partitionId=1) 
successfully.
+// Complete task(stageId=1, stageAttempt=2, partitionId=1) 
successfully.
 runEvent(makeCompletionEvent(
-  taskSets(3).tasks(1), Success, makeMapStatus("hostB", 2)))
+  taskSets(4).tasks(0), Success, makeMapStatus("hostB", 2)))
--- End diff --

yes it will, marking either of these successful will work, but the 
assumption on line 2469 is that it got marked completed there by the 
tasksetmanager.  So we don't want to send success for taskSet(3).task(1) as it 
should have already been marked success

Unfortunately you can't test the interactions in this unit test, that is 
why I'm working on another scheduler integration test but was going to do that 
under separate jira.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22208: [SPARK-25216][SQL] Improve error message when a column c...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22208
  
**[Test build #95301 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95301/testReport)**
 for PR 22208 at commit 
[`01f9cd5`](https://github.com/apache/spark/commit/01f9cd5c0450ce35f7e91ebe7328cdee3e911441).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22173
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95299/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22173
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22245
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22173
  
**[Test build #95299 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95299/testReport)**
 for PR 22173 at commit 
[`50258f7`](https://github.com/apache/spark/commit/50258f7595a49373d64d8831ff3ce410eef6e0cf).
 * This patch **fails Java style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22245
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95298/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22245
  
**[Test build #95298 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95298/testReport)**
 for PR 22245 at commit 
[`93c7bd9`](https://github.com/apache/spark/commit/93c7bd93f5dbec41a0fd4d6b5ef0bfe0bfdc235c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22226: [SPARK-25252][SQL] Support arrays of any types by to_jso...

2018-08-27 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/6
  
> Probably, you'd be better to file separate jira for each function.
> +1 for separate JIRA.

I created the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25252



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22042
  
**[Test build #95297 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95297/testReport)**
 for PR 22042 at commit 
[`7a02921`](https://github.com/apache/spark/commit/7a02921950cda865e3cd45f1d1635212c2f707c0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95297/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22173
  
**[Test build #95300 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95300/testReport)**
 for PR 22173 at commit 
[`d86503c`](https://github.com/apache/spark/commit/d86503cf34f66d7082df8677e78f5f793e1064a0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22233: [SPARK-25240][SQL] Fix for a deadlock in RECOVER ...

2018-08-27 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22233#discussion_r213050406
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -671,7 +674,7 @@ case class AlterTableRecoverPartitionsCommand(
 val value = ExternalCatalogUtils.unescapePathName(ps(1))
 if (resolver(columnName, partitionNames.head)) {
   scanPartitions(spark, fs, filter, st.getPath, spec ++ 
Map(partitionNames.head -> value),
-partitionNames.drop(1), threshold, resolver)
+partitionNames.drop(1), threshold, resolver, 
listFilesInParallel = false)
--- End diff --

I think the root cause is clear - fixed thread pool + submitting and 
waiting a future inside of another future from the the same thread pool. 
@gatorsmile I will revert parallel collection back here if you don't mind since 
there is no reasons for `parmap` in this place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...

2018-08-27 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22085#discussion_r213050049
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -180,7 +188,73 @@ private[spark] abstract class BasePythonRunner[IN, 
OUT](
 dataOut.writeInt(partitionIndex)
 // Python version of driver
 PythonRDD.writeUTF(pythonVer, dataOut)
+// Init a ServerSocket to accept method calls from Python side.
+val isBarrier = context.isInstanceOf[BarrierTaskContext]
+if (isBarrier) {
+  serverSocket = Some(new ServerSocket(/* port */ 0,
+/* backlog */ 1,
+InetAddress.getByName("localhost")))
+  // A call to accept() for ServerSocket shall block infinitely.
+  serverSocket.map(_.setSoTimeout(0))
+  new Thread("accept-connections") {
+setDaemon(true)
+
+override def run(): Unit = {
+  while (!serverSocket.get.isClosed()) {
+var sock: Socket = null
+try {
+  sock = serverSocket.get.accept()
+  // Wait for function call from python side.
+  sock.setSoTimeout(1)
+  val input = new DataInputStream(sock.getInputStream())
--- End diff --

Thanks for catching this, yea I agree it would be better to move the 
authentication before recognising functions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22238: [SPARK-25245][DOCS][SS] Explain regarding limitin...

2018-08-27 Thread arunmahadevan

Github user arunmahadevan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22238#discussion_r213049895
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -2812,6 +2812,12 @@ See [Input Sources](#input-sources) and [Output 
Sinks](#output-sinks) sections f
 
 # Additional Information
 
+**Gotchas**
--- End diff --

IMO, It would be better to keep it here as well as in the code, we may not 
be able to surface it in the right api docs and chance for users to ignore it.

@HeartSaVioR, may be add an example here to illustrate how to use the 
coalesce?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22213: [SPARK-25221][DEPLOY] Consistent trailing whitesp...

2018-08-27 Thread gerashegalov

Github user gerashegalov commented on a diff in the pull request:

https://github.com/apache/spark/pull/22213#discussion_r213049701
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2062,8 +2062,10 @@ private[spark] object Utils extends Logging {
 try {
   val properties = new Properties()
   properties.load(inReader)
-  properties.stringPropertyNames().asScala.map(
-k => (k, properties.getProperty(k).trim)).toMap
+  properties.stringPropertyNames().asScala
+.map(k => (k, properties.getProperty(k)))
--- End diff --

@jerryshao `trim` removes leading spaces as well that are totally legit. 

I also need more info regarding what you mean by ASCII in this context.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22173
  
**[Test build #95299 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95299/testReport)**
 for PR 22173 at commit 
[`50258f7`](https://github.com/apache/spark/commit/50258f7595a49373d64d8831ff3ce410eef6e0cf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22146: [SPARK-24434][K8S] pod template files

2018-08-27 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/22146#discussion_r213047538
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -185,6 +185,21 @@ To use a secret through an environment variable use 
the following options to the
 --conf spark.kubernetes.executor.secretKeyRef.ENV_NAME=name:key
 ```
 
+## Pod Template
+Kubernetes allows defining pods from [template 
files](https://kubernetes.io/docs/concepts/workloads/pods/pod-overview/#pod-templates).
+Spark users can similarly use template files to define the driver or 
executor pod configurations that Spark configurations do not support.
+To do so, specify the spark properties 
`spark.kubernetes.driver.podTemplateFile` and 
`spark.kubernetes.executor.podTemplateFile`
+to point to local files accessible to the `spark-submit` process. To allow 
the driver pod access the executor pod template
+file, the file will be automatically mounted onto a volume in the driver 
pod when it's created.
+
+It is important to note that Spark is opinionated about certain pod 
configurations so there are values in the
+pod template that will always be overwritten by Spark. Therefore, users of 
this feature should note that specifying
+the pod template file only lets Spark start with a template pod instead of 
an empty pod during the pod-building process.
+For details, see the [full list](#pod-template-properties) of pod template 
values that will be overwritten by spark.
+
+Pod template files can also define multiple containers. In such cases, 
Spark will always assume that the first container in
+the list will be the driver or executor container.
--- End diff --

is it possible to use only extra containers and not Spark specific? Could 
we have a naming convention or a less error prone convention?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18447: [SPARK-21232][SQL][SparkR][PYSPARK] New built-in SQL fun...

2018-08-27 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18447
  
Yea I'd probably reject this for now, until we see bigger needs for it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22241: [SPARK-25249][CORE][TEST]add a unit test for Open...

2018-08-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22241


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/22245
  
LGTM pending tests


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22192: [SPARK-24918][Core] Executor Plugin API

2018-08-27 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/22192#discussion_r213045752
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -130,6 +130,16 @@ private[spark] class Executor(
   private val urlClassLoader = createClassLoader()
   private val replClassLoader = addReplClassLoaderIfNeeded(urlClassLoader)
 
+  // One thread will handle loading all of the plugins on this executor
--- End diff --

I guess it does depend on what the intended use is here.  If we have it in 
the same thread it has the issue that it could block the executor or take to 
long and things start timing out.  It can have more direct impact on the 
executor code itself, where as a separate thread isolates it more.   But like 
you say if its not here and we don't wait for it then we could have order issue 
if certain plugins have to be initialized before other things happen.  I can 
see both arguments as well.So perhaps the api needs an init type function 
that can be called more inline with a timeout to prevent from taking to long 
and the main part of the plugin called in a separate thread?  




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22245
  
**[Test build #95298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95298/testReport)**
 for PR 22245 at commit 
[`93c7bd9`](https://github.com/apache/spark/commit/93c7bd93f5dbec41a0fd4d6b5ef0bfe0bfdc235c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22245
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22223: [SPARK-25233][Streaming] Give the user the option of spe...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/3
  
**[Test build #4296 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4296/testReport)**
 for PR 3 at commit 
[`85ece1c`](https://github.com/apache/spark/commit/85ece1c0866164a3f5a260b6e226b01c1fd1dd81).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22245
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22245: [SPARK-24882][FOLLOWUP] Fix flaky synchronization...

2018-08-27 Thread jose-torres

GitHub user jose-torres opened a pull request:

https://github.com/apache/spark/pull/22245

[SPARK-24882][FOLLOWUP] Fix flaky synchronization in Kafka tests.

## What changes were proposed in this pull request?

Fix flaky synchronization in Kafka tests - we need to use the scan config 
that was persisted rather than reconstructing it to identify the stream's 
current configuration.

We caught most instances of this in the original PR, but this one slipped 
through.

## How was this patch tested?

n/a

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jose-torres/spark fixflake

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22245.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22245


commit 93c7bd93f5dbec41a0fd4d6b5ef0bfe0bfdc235c
Author: Jose Torres 
Date:   2018-08-27T17:03:17Z

fix flake




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22241: [SPARK-25249][CORE][TEST]add a unit test for OpenHashMap

2018-08-27 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22241
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...

2018-08-27 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22085#discussion_r213043068
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -180,7 +188,73 @@ private[spark] abstract class BasePythonRunner[IN, 
OUT](
 dataOut.writeInt(partitionIndex)
 // Python version of driver
 PythonRDD.writeUTF(pythonVer, dataOut)
+// Init a ServerSocket to accept method calls from Python side.
+val isBarrier = context.isInstanceOf[BarrierTaskContext]
+if (isBarrier) {
+  serverSocket = Some(new ServerSocket(/* port */ 0,
+/* backlog */ 1,
+InetAddress.getByName("localhost")))
+  // A call to accept() for ServerSocket shall block infinitely.
+  serverSocket.map(_.setSoTimeout(0))
+  new Thread("accept-connections") {
+setDaemon(true)
+
+override def run(): Unit = {
+  while (!serverSocket.get.isClosed()) {
+var sock: Socket = null
+try {
+  sock = serverSocket.get.accept()
+  // Wait for function call from python side.
+  sock.setSoTimeout(1)
+  val input = new DataInputStream(sock.getInputStream())
--- End diff --

(I'd also like to do some refactoring of the socket setup code in python, 
and that can go further if we do authenticaion first here)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22042
  
**[Test build #95297 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95297/testReport)**
 for PR 22042 at commit 
[`7a02921`](https://github.com/apache/spark/commit/7a02921950cda865e3cd45f1d1635212c2f707c0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22042
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2579/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21330: [SPARK-22234] Support distinct window functions

2018-08-27 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21330
  
cc @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

2018-08-27 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21976#discussion_r213042176
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2474,19 +2478,21 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
 runEvent(makeCompletionEvent(
   taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
 
-// There should be no new attempt of stage submitted,
-// because task(stageId=1, stageAttempt=1, partitionId=1) is still 
running in
-// the current attempt (and hasn't completed successfully in any 
earlier attempts).
-assert(taskSets.size === 4)
+// At this point there should be no active task set for stageId=1 and 
we need
+// to resubmit because the output from (stageId=1, stageAttemptId=0, 
partitionId=1)
+// was ignored due to executor failure
+assert(taskSets.size === 5)
+assert(taskSets(4).stageId === 1 && taskSets(4).stageAttemptId === 2
+  && taskSets(4).tasks.size === 1)
 
-// Complete task(stageId=1, stageAttempt=1, partitionId=1) 
successfully.
+// Complete task(stageId=1, stageAttempt=2, partitionId=1) 
successfully.
 runEvent(makeCompletionEvent(
-  taskSets(3).tasks(1), Success, makeMapStatus("hostB", 2)))
+  taskSets(4).tasks(0), Success, makeMapStatus("hostB", 2)))
--- End diff --

IIUC the test case shall still pass without changing this line right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21546
  
**[Test build #95296 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95296/testReport)**
 for PR 21546 at commit 
[`2fe46f8`](https://github.com/apache/spark/commit/2fe46f82dc38af972bc0974aca1fd846bcb483e5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22233: [SPARK-25240][SQL] Fix for a deadlock in RECOVER ...

2018-08-27 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22233#discussion_r213041684
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -671,7 +674,7 @@ case class AlterTableRecoverPartitionsCommand(
 val value = ExternalCatalogUtils.unescapePathName(ps(1))
 if (resolver(columnName, partitionNames.head)) {
   scanPartitions(spark, fs, filter, st.getPath, spec ++ 
Map(partitionNames.head -> value),
-partitionNames.drop(1), threshold, resolver)
+partitionNames.drop(1), threshold, resolver, 
listFilesInParallel = false)
--- End diff --

@kiszk Thanks for the investigation! Please take a look at the root cause? 
If unable to figure it out, we need to revert it back to `.par`. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22042: [SPARK-25005][SS]Support non-consecutive offsets for Kaf...

2018-08-27 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/22042
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2578/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22223: [SPARK-25233][Streaming] Give the user the option of spe...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/3
  
**[Test build #4296 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4296/testReport)**
 for PR 3 at commit 
[`85ece1c`](https://github.com/apache/spark/commit/85ece1c0866164a3f5a260b6e226b01c1fd1dd81).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22197: [SPARK-25207][SQL] Case-insensitve field resolution for ...

2018-08-27 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22197
  
Thanks. I got it. Definitely, it's irrelevant to this and an intentional 
regression due to that reverting.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22024: [SPARK-25034][CORE] Remove allocations in onBlock...

2018-08-27 Thread vincent-grosbois

Github user vincent-grosbois commented on a diff in the pull request:

https://github.com/apache/spark/pull/22024#discussion_r213037747
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/BlockTransferService.scala ---
@@ -101,15 +101,7 @@ abstract class BlockTransferService extends 
ShuffleClient with Closeable with Lo
   result.failure(exception)
 }
 override def onBlockFetchSuccess(blockId: String, data: 
ManagedBuffer): Unit = {
-  data match {
-case f: FileSegmentManagedBuffer =>
-  result.success(f)
-case _ =>
-  val ret = ByteBuffer.allocate(data.size.toInt)
--- End diff --

I don't really understand the point of this initial commit tbh, was there 
ever a rationale for it ? (I can't find any comments).

I made sure it works by testing it on our dataset (it will indeed crash if 
the ref count is not incremented).

All 69f5d0a does is transforming a ManagedBuffer (abstract) into the 
concrete sub-type NioManagedBuffer. There is no real reason to copy the data, 
as long as you keep track of the reference count



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22236
  
**[Test build #95294 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95294/testReport)**
 for PR 22236 at commit 
[`957a6a2`](https://github.com/apache/spark/commit/957a6a2cf0e05f01c2c2d602944b8da8cfb1b426).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21638
  
**[Test build #95295 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95295/testReport)**
 for PR 21638 at commit 
[`5e46efb`](https://github.com/apache/spark/commit/5e46efb5f5ce86297c4aeb23bf934fd9942de3de).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21638
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2577/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21638
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22236
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2576/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22236
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22173
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95293/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22173
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22173
  
**[Test build #95293 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95293/testReport)**
 for PR 22173 at commit 
[`6580ff1`](https://github.com/apache/spark/commit/6580ff1abec42f640c3090edfa32466f8f5b5212).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21977: [SPARK-25004][CORE] Add spark.executor.pyspark.me...

2018-08-27 Thread rdblue

Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21977#discussion_r213035238
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
---
@@ -91,6 +91,13 @@ private[spark] class Client(
   private val executorMemoryOverhead = 
sparkConf.get(EXECUTOR_MEMORY_OVERHEAD).getOrElse(
 math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toLong, 
MEMORY_OVERHEAD_MIN)).toInt
 
+  private val isPython = sparkConf.get(IS_PYTHON_APP)
--- End diff --

@holdenk, can you point me to that repo? I'd love to have a look at how you 
do mixed pipelines.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22173: [SPARK-24335] Spark external shuffle server improvement ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22173
  
**[Test build #95293 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95293/testReport)**
 for PR 22173 at commit 
[`6580ff1`](https://github.com/apache/spark/commit/6580ff1abec42f640c3090edfa32466f8f5b5212).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22236: [SPARK-10697][ML] Add lift to Association rules

2018-08-27 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22236
  
Yeah, I like that idea. Just compute it on initializing the model. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22085: [SPARK-25095][PySpark] Python support for Barrier...

2018-08-27 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22085#discussion_r213032992
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -180,7 +188,73 @@ private[spark] abstract class BasePythonRunner[IN, 
OUT](
 dataOut.writeInt(partitionIndex)
 // Python version of driver
 PythonRDD.writeUTF(pythonVer, dataOut)
+// Init a ServerSocket to accept method calls from Python side.
+val isBarrier = context.isInstanceOf[BarrierTaskContext]
+if (isBarrier) {
+  serverSocket = Some(new ServerSocket(/* port */ 0,
+/* backlog */ 1,
+InetAddress.getByName("localhost")))
+  // A call to accept() for ServerSocket shall block infinitely.
+  serverSocket.map(_.setSoTimeout(0))
+  new Thread("accept-connections") {
+setDaemon(true)
+
+override def run(): Unit = {
+  while (!serverSocket.get.isClosed()) {
+var sock: Socket = null
+try {
+  sock = serverSocket.get.accept()
+  // Wait for function call from python side.
+  sock.setSoTimeout(1)
+  val input = new DataInputStream(sock.getInputStream())
--- End diff --

why is authentication the first thing which happens on this connection?  I 
don't think anything bad can happen in this case, but it just makes it more 
likely we leave a security hole here later on.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22243: [MINOR] Avoid code duplication for nullable in Hi...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22243#discussion_r213029487
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -155,6 +155,8 @@ trait HigherOrderFunction extends Expression with 
ExpectsInputTypes {
  */
 trait SimpleHigherOrderFunction extends HigherOrderFunction  {
 
+  override def nullable: Boolean = argument.nullable
--- End diff --

this works too IMO, if others agree I'll update with this suggestion, 
thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-08-27 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r213026874
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -815,6 +815,24 @@ class Dataset[T] private[sql](
 println(showString(numRows, truncate, vertical))
   // scalastyle:on println
 
+  /**
+   * Returns the default number of rows to show when the show function is 
called without
+   * a user specified max number of rows.
+   * @since 2.3.0
+   */
+  private def numberOfRowsToShow(): Int = {
--- End diff --

we shouldn't be adding methods here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22244
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22244
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95292/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22244
  
**[Test build #95292 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95292/testReport)**
 for PR 22244 at commit 
[`f0e547c`](https://github.com/apache/spark/commit/f0e547c971f854b8a238baaebff8103036567223).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ArrowEvalPython(udfs: Seq[PythonUDF], output: 
Seq[Attribute], child: LogicalPlan)`
  * `case class BatchEvalPython(udfs: Seq[PythonUDF], output: 
Seq[Attribute], child: LogicalPlan)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22243: [MINOR] Avoid code duplication for nullable in Hi...

2018-08-27 Thread mn-mikke

Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22243#discussion_r213022884
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -155,6 +155,8 @@ trait HigherOrderFunction extends Expression with 
ExpectsInputTypes {
  */
 trait SimpleHigherOrderFunction extends HigherOrderFunction  {
 
+  override def nullable: Boolean = argument.nullable
--- End diff --

If we moved the definition of ```nullable``` straight to 
```HigherOrderFunction``` as ```arguments.exists(_.nullable)```, we could also 
avoid the duplicities in ```ZipWith``` and ```MapZipWith```. WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22244
  
**[Test build #95292 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95292/testReport)**
 for PR 22244 at commit 
[`f0e547c`](https://github.com/apache/spark/commit/f0e547c971f854b8a238baaebff8103036567223).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22244
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22244
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2575/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF ...

2018-08-27 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22244
  
cc @icexelloss @HyukjinKwon @rdblue 

@icexelloss feel free to take this over and verify if it can pass the tests 
you added in #22104 , thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22244: [WIP][SPARK-24721][SPARK-25213][SQL] extract pyth...

2018-08-27 Thread cloud-fan

GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/22244

[WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF at the end of 
optimizer

## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/12127 , we moved the 
`ExtractPythonUDFs` rule to the physical phase, while there was another option: 
do `ExtractPythonUDFs` at the end of optimizer.

Currently we hit 2 issues when exacting python UDFs at physical phase:
1. it happens after data source v2 strategy, so data source v2 strategy 
needs to deal with python udfs carefully and adds project to produce unsafe row 
for python udf. See https://github.com/apache/spark/pull/22206
2. it happens after file source strategy, so we may keep Python UDF as data 
filter in `FileSourceScanExec` and fail the planner when try to extract it 
later. See https://github.com/apache/spark/pull/22104

This PR proposes to move `ExtractPythonUDFs` to the end of optimizer.

## How was this patch tested?

TODO

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark python

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22244


commit f0e547c971f854b8a238baaebff8103036567223
Author: Wenchen Fan 
Date:   2018-08-27T15:40:18Z

extract python UDF at the end of optimizer




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

2018-08-27 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21976
  
just an fyi, the other jira is 
https://issues.apache.org/jira/browse/SPARK-25250, its related to a race with 
SPARK-23433


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in Pyspark

2018-08-27 Thread RussellSpitzer

Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/21990
  
What I wanted was to just call the Scala Methods, instead of having half 
the code and half in python, but we create the JVM in the SparkContext creation 
code so this ends up not being a good method I think. We could just translate 
the rest of GetOrCreate into Python but then every time there is a patch of the 
code in scala it will need a Python mod as well. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-27 Thread seancxmao

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213020789
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

As a followup to cloud-fan's point, I did a deep dive into read path of 
parquet hive serde table. Following is a rough invocation chain:

```
org.apache.spark.sql.hive.execution.HiveTableScanExec
org.apache.spark.sql.hive.HadoopTableReader (extendes 
org.apache.spark.sql.hive.TableReader)
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat (extends 
org.apache.hadoop.mapred.FileInputFormat)
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper 
(extends org.apache.hadoop.mapred.RecordReader)
parquet.hadoop.ParquetRecordReader
parquet.hadoop.InternalParquetRecordReader
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport (extends 
parquet.hadoop.api.ReadSupport)
```

Finally, `DataWritableReadSupport#getFieldTypeIgnoreCase` is invoked. 


https://github.com/JoshRosen/hive/blob/release-1.2.1-spark2/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L79-L95

This is why parquet hive serde table always do case-insensitive field 
resolution. However, this is a class inside 
`org.spark-project.hive:hive-exec:1.2.1.spark2`.

I also found the related Hive JIRA ticket:
[HIVE-7554: Parquet Hive should resolve column names in case insensitive 
manner](https://issues.apache.org/jira/browse/HIVE-7554)

BTW:
* org.apache.hadoop.hive.ql = org.spark-project.hive:hive-exec:1.2.1.spark2
* parquet.hadoop = com.twitter:parquet-hadoop-bundle:1.6.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22241: [SPARK-25249][CORE][TEST]add a unit test for OpenHashMap

2018-08-27 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22241
  
@kiszk I guess it's because in this case the underlying value type is a 
primitive like int or long, so null can't be returned?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22162
  
sure, no worries @kiszk, I can take it if needed. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-27 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22112#discussion_r213009399
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -33,6 +33,9 @@ import org.apache.spark.util.random.SamplingUtils
 /**
  * An object that defines how the elements in a key-value pair RDD are 
partitioned by key.
  * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
+ *
+ * Note that, partitioner must be idempotent, i.e. it must return the same 
partition id given the
--- End diff --

I think you mean deterministic, not idempotent (which would mean that 
`partition(key) == partition(partition(key))`)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-27 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22112#discussion_r213017779
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1865,6 +1871,62 @@ abstract class RDD[T: ClassTag](
   // RDD chain.
   @transient protected lazy val isBarrier_ : Boolean =
 dependencies.filter(!_.isInstanceOf[ShuffleDependency[_, _, 
_]]).exists(_.rdd.isBarrier())
+
+  /**
+   * Returns the random level of this RDD's output. Please refer to 
[[RandomLevel]] for the
+   * definition.
+   *
+   * By default, an reliably checkpointed RDD, or RDD without parents(root 
RDD) is IDEMPOTENT. For
+   * RDDs with parents, we will generate a random level candidate per 
parent according to the
+   * dependency. The random level of the current RDD is the random level 
candidate that is random
+   * most. Please override [[getOutputRandomLevel]] to provide custom 
logic of calculating output
+   * random level.
+   */
+  // TODO: make it public so users can set random level to their custom 
RDDs.
+  // TODO: this can be per-partition. e.g. UnionRDD can have different 
random level for different
+  // partitions.
+  private[spark] final lazy val outputRandomLevel: RandomLevel.Value = {
+if 
(checkpointData.exists(_.isInstanceOf[ReliableRDDCheckpointData[_]])) {
--- End diff --

hmm, so I took another look at the checkpoint code, and it seems to me like 
it doesn't checkpointing will actually help.  IIUC, checkpointing doesn't 
actually take place until the *job* finishes, not just the stage:


https://github.com/apache/spark/blob/6193a202aab0271b4532ee4b740318290f2c44a1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2061-L2063

So when you have a failure in the middle of a job with a long pipeline, 
when you go back to an earlier stage, you're not actually going back to 
checkpointed data.

But maybe I'm reading this wrong?  doesn't seem like what checkpointing 
_should_ be doing, actually ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...

2018-08-27 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22112#discussion_r213010846
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1918,3 +1980,19 @@ object RDD {
 new DoubleRDDFunctions(rdd.map(x => num.toDouble(x)))
   }
 }
+
+/**
+ * The random level of RDD's output (i.e. what `RDD#compute` returns), 
which indicates how the
+ * output will diff when Spark reruns the tasks for the RDD. There are 3 
random levels, ordered
+ * by the randomness from low to high:
+ * 1. IDEMPOTENT: The RDD output is always same (including order) when 
rerun.
--- End diff --

here too, idempotent is the wrong word for this ... deteminstic?  
partition-ordered? (I guess "ordered" could make it seem like the entire data 
is ordered ...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22024: [SPARK-25034][CORE] Remove allocations in onBlockFetchSu...

2018-08-27 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22024
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22024: [SPARK-25034][CORE] Remove allocations in onBlock...

2018-08-27 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22024#discussion_r213015113
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/BlockTransferService.scala ---
@@ -101,15 +101,7 @@ abstract class BlockTransferService extends 
ShuffleClient with Closeable with Lo
   result.failure(exception)
 }
 override def onBlockFetchSuccess(blockId: String, data: 
ManagedBuffer): Unit = {
-  data match {
-case f: FileSegmentManagedBuffer =>
-  result.success(f)
-case _ =>
-  val ret = ByteBuffer.allocate(data.size.toInt)
--- End diff --

The copy behavior was introduced by : 
https://github.com/apache/spark/pull/2330/commits/69f5d0a2434396abbbd98886e047bc08a9e65565.
 How can you make sure this can be replaced by increasing the reference count?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22024: [SPARK-25034][CORE] Remove allocations in onBlock...

2018-08-27 Thread xuanyuanking

Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/22024#discussion_r213015245
  
--- Diff: 
core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala ---
@@ -160,7 +160,13 @@ private[spark] class TorrentBroadcast[T: 
ClassTag](obj: T, id: Long)
   releaseLock(pieceId)
 case None =>
   bm.getRemoteBytes(pieceId) match {
-case Some(b) =>
+case Some(splitB) =>
+
+  // Checksum computation and further computations require the 
data
+  // from the ChunkedByteBuffer to be merged, so we we merge 
it now.
--- End diff --

nit of the comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-08-27 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22162
  
Would someone please take it?
I have less bandwidth next two days since I will be in a training session 
at my office.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...

2018-08-27 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20637#discussion_r213013507
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala
 ---
@@ -223,8 +223,9 @@ trait ExpressionEvalHelper extends 
GeneratorDrivenPropertyChecks with PlanTestBa
   }
 } else {
   val lit = InternalRow(expected, expected)
+  val dtAsNullable = expression.dataType.asNullable
--- End diff --

@ueshin @cloud-fan Thank you for good summary.

I think that this does not reduce test coverage.  
This ` dtAsNullable = expression.dataType.asNullable` is used only for 
generating `expected`. This `asNullable` does not change `dataType` of 
`expression`. Thus, this does not change our optimization assumption.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r213010807
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -815,6 +815,24 @@ class Dataset[T] private[sql](
 println(showString(numRows, truncate, vertical))
   // scalastyle:on println
 
+  /**
+   * Returns the default number of rows to show when the show function is 
called without
+   * a user specified max number of rows.
+   * @since 2.3.0
+   */
+  private def numberOfRowsToShow(): Int = {
+this.sparkSession.conf.get("spark.sql.show.defaultNumRows", "20").toInt
+  }
+
+  /**
+   * Returns the default max characters per column to show before 
truncation when
+   * the show function is called with truncate.
+   * @since 2.3.0
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r213010706
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -815,6 +815,24 @@ class Dataset[T] private[sql](
 println(showString(numRows, truncate, vertical))
   // scalastyle:on println
 
+  /**
+   * Returns the default number of rows to show when the show function is 
called without
+   * a user specified max number of rows.
+   * @since 2.3.0
+   */
+  private def numberOfRowsToShow(): Int = {
+this.sparkSession.conf.get("spark.sql.show.defaultNumRows", "20").toInt
+  }
+
+  /**
+   * Returns the default max characters per column to show before 
truncation when
+   * the show function is called with truncate.
+   * @since 2.3.0
+   */
+  private def maxCharactersPerColumnToShow(): Int = {
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r213010879
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -815,6 +815,24 @@ class Dataset[T] private[sql](
 println(showString(numRows, truncate, vertical))
   // scalastyle:on println
 
+  /**
+   * Returns the default number of rows to show when the show function is 
called without
+   * a user specified max number of rows.
+   * @since 2.3.0
--- End diff --

not needed as this is private, moreover, it'd be since 2.4.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r213010672
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -815,6 +815,24 @@ class Dataset[T] private[sql](
 println(showString(numRows, truncate, vertical))
   // scalastyle:on println
 
+  /**
+   * Returns the default number of rows to show when the show function is 
called without
+   * a user specified max number of rows.
+   * @since 2.3.0
+   */
+  private def numberOfRowsToShow(): Int = {
--- End diff --

I'd remove the `()`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21990: [SPARK-25003][PYSPARK] Use SessionExtensions in P...

2018-08-27 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21990#discussion_r213010408
  
--- Diff: python/pyspark/sql/session.py ---
@@ -218,7 +218,9 @@ def __init__(self, sparkContext, jsparkSession=None):
 .sparkContext().isStopped():
 jsparkSession = 
self._jvm.SparkSession.getDefaultSession().get()
 else:
-jsparkSession = self._jvm.SparkSession(self._jsc.sc())
+jsparkSession = self._jvm.SparkSession.builder() \
--- End diff --

@RussellSpitzer, have you maybe had a chance to take a look and see if we 
can deduplicate some logics comparing to Scala's `getOrCreate`? I am suggesting 
this since now it looks the code path duplicates some logics there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-08-27 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22162
  
sure @HyukjinKwon, thanks for pinging me anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22223: [SPARK-25233][Streaming] Give the user the option of spe...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/3
  
**[Test build #4295 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4295/testReport)**
 for PR 3 at commit 
[`2b0b1ce`](https://github.com/apache/spark/commit/2b0b1ce3876e2f55807156a98f75068280e03054).
 * This patch **fails MiMa tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22233: [SPARK-25240][SQL] Fix for a deadlock in RECOVER ...

2018-08-27 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22233#discussion_r213009058
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -671,7 +674,7 @@ case class AlterTableRecoverPartitionsCommand(
 val value = ExternalCatalogUtils.unescapePathName(ps(1))
 if (resolver(columnName, partitionNames.head)) {
   scanPartitions(spark, fs, filter, st.getPath, spec ++ 
Map(partitionNames.head -> value),
-partitionNames.drop(1), threshold, resolver)
+partitionNames.drop(1), threshold, resolver, 
listFilesInParallel = false)
--- End diff --

Thank you attaching the stack trace. I have just looked at it. It looks 
strange to me. Every thread is `waiting for`. No blocker is there, only one 
`locked` exists.
In typical case, a deadlock occurs due to existence of blocker as attached 
stack trace in #1

I will investigate it furthermore tomorrow if we need to use this 
implementation instead of reverting it to the original implementation to use 
Scala parallel collection.

```
...
- parking to wait for  <0x000793c0d610> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:206)
at 
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:317)
at 
org.apache.spark.sql.execution.command.AlterTableRecoverPartitionsCommand.scanPartitions(ddl.scala:690)
at 
org.apache.spark.sql.execution.command.AlterTableRecoverPartitionsCommand.run(ddl.scala:626)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
- locked <0x000793b04e88> (a 
org.apache.spark.sql.execution.command.ExecutedCommandExec)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22226: [SPARK-24391][SQL] Support arrays of any types by to_jso...

2018-08-27 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/6
  
**[Test build #95291 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95291/testReport)**
 for PR 6 at commit 
[`906a301`](https://github.com/apache/spark/commit/906a3013e97f8e1d1f8f7e0e335f7404de47b582).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-08-27 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22162
  
@viirya, @kiszk, @mgaido91 and @maropu, would you be interested in taking 
this over if this gets inactive for few more days?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-08-27 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22162
  
ping @AndrewKL 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

301 - 400 of 537 matches

Mail list logo