date:20180529

[GitHub] spark issue #21366: [SPARK-24248][K8S] Use the Kubernetes API to populate an...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21366
  
**[Test build #91273 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91273/testReport)**
 for PR 21366 at commit 
[`b30ed39`](https://github.com/apache/spark/commit/b30ed39ebecc72cadfc9ec20b135d60f618762a4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21450: [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution w...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21450
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91262/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21450: [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution w...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21450
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21450: [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution w...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21450
  
**[Test build #91262 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91262/testReport)**
 for PR 21450 at commit 
[`a69850b`](https://github.com/apache/spark/commit/a69850b6fdcbe2e234e70a597d9ad6beae6a6937).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/21456
  
Ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread countmdm

Github user countmdm commented on the issue:

https://github.com/apache/spark/pull/21456
  
Yes.

On Tue, May 29, 2018 at 1:18 PM, UCB AMPLab 
wrote:

> Can one of the admins verify this patch?
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21366: [SPARK-24248][K8S] Use the Kubernetes API to popu...

2018-05-29 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/21366#discussion_r191559897
  
--- Diff: pom.xml ---
@@ -760,6 +760,12 @@
 1.10.19
 test
   
+  
--- End diff --

Is it necessary to include the license file if the dependent project 
(RxJava) is also licensed under Apache 2.0?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21456: [SPARK-24356] [CORE] Duplicate strings in File.path mana...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21456
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21456: [SPARK-24356] [CORE] Duplicate strings in File.pa...

2018-05-29 Thread countmdm

GitHub user countmdm opened a pull request:

https://github.com/apache/spark/pull/21456

[SPARK-24356] [CORE] Duplicate strings in File.path managed by 
FileSegmentManagedBuffer

This patch eliminates duplicate strings that come from the 'path' field of
java.io.File objects created by FileSegmentManagedBuffer. That is, we want
to avoid the situation when multiple File instances for the same pathname
"foo/bar" are created, each with a separate copy of the "foo/bar" String
instance. In some scenarios such duplicate strings may waste a lot of memory
(~ 10% of the heap). To avoid that, we intern the pathname with
String.intern(), and before that we make sure that it's in a normalized
form (contains no "//", "///" etc.) Otherwise, the code in java.io.File
would normalize it later, creating a new "foo/bar" String copy.
Unfortunately, the normalization code that java.io.File uses internally
is in the package-private class java.io.FileSystem, so we cannot call it
here directly.

## What changes were proposed in this pull request?

Added code to ExternalShuffleBlockResolver.getFile(), that normalizes and 
then interns the pathname string before passing it to the File() constructor.

## How was this patch tested?

Added unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/countmdm/spark misha/spark-24356

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21456


commit 0df2607dfe79917f89652ec395b5e66f6c40f17b
Author: Misha Dmitriev 
Date:   2018-05-29T18:56:41Z

[SPARK-24356] [CORE] Duplicate strings in File.path managed by 
FileSegmentManagedBuffer

This patch eliminates duplicate strings that come from the 'path' field of
java.io.File objects created by FileSegmentManagedBuffer. That is, we want
to avoid the situation when multiple File instances for thesame pathname
"foo/bar" are created, each with a separate copy of the "foo/bar" String
instance. In some scenarios such duplicate strings may waste a lot of memory
(~ 10% of the heap). To avoid that, we intern the pathname with
String.intern(), and before that we make sure that it's in a normalized
form (contains no "//", "///" etc.) Otherwise, the code in java.io.File
would normalize it later, creating a new "foo/bar" String copy.
Unfortunately, the normalization code that java.io.File uses internally
is in the package-private class java.io.FileSystem, so we cannot call it
here directly.

Added unit test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21403: [SPARK-24341][WIP][SQL] Support IN subqueries with struc...

2018-05-29 Thread juliuszsompolski

Github user juliuszsompolski commented on the issue:

https://github.com/apache/spark/pull/21403
  
@mgaido91 BTW: In SPARK-24395 I would consider the cases to still be valid, 
because I believe there is no other syntactic way to do a multi-column IN/NOT 
IN with list of literals.
The question is whether it should be treated as structs, or unpacked?
If like structs, then the current behavior is correct, I think.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21440
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91259/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21440
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21440: [SPARK-24307][CORE] Support reading remote cached partit...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21440
  
**[Test build #91259 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91259/testReport)**
 for PR 21440 at commit 
[`a9cfe29`](https://github.com/apache/spark/commit/a9cfe294b15b2c9675d074645865fa403285d1d2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21455
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21455: [SPARK-24093][DStream][Minor]Make some fields of KafkaSt...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21455
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21455: [SPARK-24093][DStream][Minor]Make some fields of ...

2018-05-29 Thread merlintang

GitHub user merlintang opened a pull request:

https://github.com/apache/spark/pull/21455

[SPARK-24093][DStream][Minor]Make some fields of KafkaStreamWriter/Inâ¦

â¦ternalRowMicroBatchWriter visible to outside of the classes

## What changes were proposed in this pull request?

This PR is created to make relevant fields of KafkaStreamWriter and 
InternalRowMicroBatchWriter visible to outside of the classes.

## How was this patch tested?
manual tests

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/merlintang/spark âSpark-24093â

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21455.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21455


commit 6233528063996dabe780d5b04f874f22846e40d4
Author: Mingjie Tang 
Date:   2018-05-29T19:49:17Z

[SPARK-24093][DStream][Minor]Make some fields of 
KafkaStreamWriter/InternalRowMicroBatchWriter visible to outside of the classes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocalPropert...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21437
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocalPropert...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21437
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3681/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21346: [SPARK-6237][NETWORK] Network-layer changes to allow str...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21346
  
**[Test build #4190 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4190/testReport)**
 for PR 21346 at commit 
[`7bd1b43`](https://github.com/apache/spark/commit/7bd1b43c81a3cdd7b88cf64994cfe8f2b3c5fdf8).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocalPropert...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21437
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocalPropert...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21437
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91272/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocalPropert...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21437
  
**[Test build #91272 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91272/testReport)**
 for PR 21437 at commit 
[`e197460`](https://github.com/apache/spark/commit/e197460cf56c42d164da5955eaa62d4f80cf033e).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3680/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocalPropert...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21437
  
**[Test build #91272 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91272/testReport)**
 for PR 21437 at commit 
[`e197460`](https://github.com/apache/spark/commit/e197460cf56c42d164da5955eaa62d4f80cf033e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocal...

2018-05-29 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/21437#discussion_r191548734
  
--- Diff: python/pyspark/tests.py ---
@@ -543,6 +543,15 @@ def test_tc_on_driver(self):
 tc = TaskContext.get()
 self.assertTrue(tc is None)
 
+def test_get_local_property(self):
+"""Verify that local properties set on the driver are available in 
TaskContext."""
+self.sc.setLocalProperty("testkey", "testvalue")
--- End diff --

there isnt a clear way to clear local property in the spark context? what 
do you propose is the right approach here? set the local property as "None"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #91271 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91271/testReport)**
 for PR 21451 at commit 
[`68c5d5f`](https://github.com/apache/spark/commit/68c5d5f5f60da7cbc0ce356acd8e5ab31db70ea5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-29 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21454
  
cc @zsxwing @jiangxb1987 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21454
  
**[Test build #91270 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91270/testReport)**
 for PR 21454 at commit 
[`f198f28`](https://github.com/apache/spark/commit/f198f28b1a3d7380a09e5687438a264101cc6965).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-29 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21454
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-29 Thread PenguinToast

Github user PenguinToast commented on the issue:

https://github.com/apache/spark/pull/21454
  
@gatorsmile Can you take a look at this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21454: [SPARK-24337][Core] Improve error messages for Spark con...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21454
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21454: [SPARK-24337][Core] Improve error messages for Sp...

2018-05-29 Thread PenguinToast

GitHub user PenguinToast opened a pull request:

https://github.com/apache/spark/pull/21454

[SPARK-24337][Core] Improve error messages for Spark conf values

## What changes were proposed in this pull request?

Improve the exception messages when retrieving Spark conf values to include 
the key name when the value is invalid.

## How was this patch tested?

Unit tests for all get* operations in SparkConf that require a specific 
value format


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/PenguinToast/spark 
SPARK-24337-spark-config-errors

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21454.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21454


commit f198f28b1a3d7380a09e5687438a264101cc6965
Author: William Sheu 
Date:   2018-05-29T18:35:25Z

[SPARK-24337][Core] Improve error messages for Spark conf values




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21428: [SPARK-24235][SS] Implement continuous shuffle wr...

2018-05-29 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/21428#discussion_r191529007
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/shuffle/ContinuousShuffleSuite.scala
 ---
@@ -40,22 +60,129 @@ class ContinuousShuffleReadSuite extends StreamTest {
 messages.foreach(endpoint.askSync[Unit](_))
   }
 
-  // In this unit test, we emulate that we're in the task thread where
-  // ContinuousShuffleReadRDD.compute() will be evaluated. This requires a 
task context
-  // thread local to be set.
-  var ctx: TaskContextImpl = _
+  private def readRDDEndpoint(rdd: ContinuousShuffleReadRDD) = {
+rdd.partitions(0).asInstanceOf[ContinuousShuffleReadPartition].endpoint
+  }
 
-  override def beforeEach(): Unit = {
-super.beforeEach()
-ctx = TaskContext.empty()
-TaskContext.setTaskContext(ctx)
+  private def readEpoch(rdd: ContinuousShuffleReadRDD) = {
+rdd.compute(rdd.partitions(0), ctx).toSeq.map(_.getInt(0))
   }
 
-  override def afterEach(): Unit = {
-ctx.markTaskCompleted(None)
-TaskContext.unset()
-ctx = null
-super.afterEach()
+  test("one epoch") {
--- End diff --

nit: i generally put the simplest test first (likely to be the reader tests 
since they dont depend on writer) and the more complex, e2e-ish tests later 
(writers since they needs readers).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21428: [SPARK-24235][SS] Implement continuous shuffle wr...

2018-05-29 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/21428#discussion_r191528142
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/ContinuousShuffleWriter.scala
 ---
@@ -0,0 +1,27 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.continuous.shuffle
+
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+
+/**
+ * Trait for writing to a continuous processing shuffle.
+ */
+trait ContinuousShuffleWriter {
+  def write(epoch: Iterator[UnsafeRow]): Unit
--- End diff --

I see. That's fair.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21453
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21453: Test branch to see how Scala 2.11.12 performs

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21453
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20701
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20701
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91256/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20701: [SPARK-23528][ML] Add numIter to ClusteringSummary

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20701
  
**[Test build #91256 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91256/testReport)**
 for PR 20701 at commit 
[`e2f68ac`](https://github.com/apache/spark/commit/e2f68ac612227aaafa809ad5f5074d1984aa907e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21453: Test branch to see how Scala 2.11.12 performs

2018-05-29 Thread RussellSpitzer

GitHub user RussellSpitzer opened a pull request:

https://github.com/apache/spark/pull/21453

Test branch to see how Scala 2.11.12 performs

This may be useful when Java 8 is no longer supported since
Scala 2.11.12 supports later versions of Java

## What changes were proposed in this pull request?

Change Scala Build Version to 2.11.12.

## How was this patch tested?

This PR is made to run 2.11.12 Scala through Jenkins to see whether or not 
it passes cleanly.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/RussellSpitzer/spark Scala2_11_12_Test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21453


commit 90d3842616aec94d603a68d44463eb043c5a66f9
Author: Russell Spitzer 
Date:   2018-05-29T18:17:23Z

Test branch to see how Scala 2.11.12 performs

This may be useful when Java 8 is no longer supported since
Scala 2.11.12 supports later versions of Java




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21068: [SPARK-16630][YARN] Blacklist a node if executors won't ...

2018-05-29 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21068
  
so specifically on the limit, I'm ok with removing it as long as we have 
the basic check to fail.  I guess perhaps you are saying the limit and that 
check are essentially the same thing?   I was thinking that they were different 
in that if you remove the limit from yarn, then the driver and UI side wouldn't 
get out of sync since the only thing the yarn side would do is fail if it hit 
the condition that all nodes are blacklisted.   If you leave the limit as is, 
like you mention it could be a bit confusing to the user as it could acquire an 
executor on the node that was blacklisted but on the yarn side we don't enforce 
due to the limit.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21068: [SPARK-16630][YARN] Blacklist a node if executors...

2018-05-29 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/21068#discussion_r191520704
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala 
---
@@ -328,4 +328,19 @@ package object config {
 CACHED_FILES_TYPES,
 CACHED_CONF_ARCHIVE)
 
+  /* YARN allocator-level blacklisting related config entries. */
+  private[spark] val YARN_EXECUTOR_LAUNCH_BLACKLIST_ENABLED =
+
ConfigBuilder("spark.yarn.blacklist.executor.launch.blacklisting.enabled")
+  .booleanConf
+  .createOptional
--- End diff --

why createOptional vs createwithdefault?  Depending on what we decide for 
limit, worse case is we just have default false.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21442: [SPARK-24402] [SQL] Optimize `In` expression when only o...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21442
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21442: [SPARK-24402] [SQL] Optimize `In` expression when only o...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21442
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91258/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21442: [SPARK-24402] [SQL] Optimize `In` expression when only o...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21442
  
**[Test build #91258 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91258/testReport)**
 for PR 21442 at commit 
[`7a354fc`](https://github.com/apache/spark/commit/7a354fcd154ec2d8f88a5c1fbf1bd75fdb15ec49).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21409
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21409
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3679/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21403: [SPARK-24341][WIP][SQL] Support IN subqueries with struc...

2018-05-29 Thread juliuszsompolski

Github user juliuszsompolski commented on the issue:

https://github.com/apache/spark/pull/21403
  
@mgaido91 This also works, +1.
What about `a in (select (b, c) from ...)` when `a` is a struct? - I guess 
allow it, but a potential gotcha during implementation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21409
  
**[Test build #91269 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91269/testReport)**
 for PR 21409 at commit 
[`e90fa00`](https://github.com/apache/spark/commit/e90fa00e8963eb985bdd30d9a262c61f6ca1ce61).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3678/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #91268 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91268/testReport)**
 for PR 21451 at commit 
[`6ca6f8d`](https://github.com/apache/spark/commit/6ca6f8dc0dd7962194fc53e6cc9945a2f38e20dc).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class UploadBlockStream extends BlockTransferMessage `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91268/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #91268 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91268/testReport)**
 for PR 21451 at commit 
[`6ca6f8d`](https://github.com/apache/spark/commit/6ca6f8dc0dd7962194fc53e6cc9945a2f38e20dc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21437: [SPARK-24397][PYSPARK] Added TaskContext.getLocal...

2018-05-29 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21437#discussion_r191509834
  
--- Diff: python/pyspark/tests.py ---
@@ -543,6 +543,15 @@ def test_tc_on_driver(self):
 tc = TaskContext.get()
 self.assertTrue(tc is None)
 
+def test_get_local_property(self):
+"""Verify that local properties set on the driver are available in 
TaskContext."""
+self.sc.setLocalProperty("testkey", "testvalue")
--- End diff --

We should cleanup the property after the test to avoid affecting to other 
tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

2018-05-29 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21427#discussion_r191511343
  
--- Diff: python/pyspark/worker.py ---
@@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
 "Number of columns of the returned pandas.DataFrame "
 "doesn't match specified schema. "
 "Expected: {} Actual: {}".format(len(return_type), 
len(result.columns)))
-arrow_return_types = (to_arrow_type(field.dataType) for field in 
return_type)
-return [(result[result.columns[i]], arrow_type)
-for i, arrow_type in enumerate(arrow_return_types)]
+try:
+# Assign result columns by schema name
+return [(result[field.name], to_arrow_type(field.dataType)) 
for field in return_type]
+except KeyError:
+if all(not isinstance(name, basestring) for name in 
result.columns):
+# Assign result columns by position if they are not named 
with strings
+return [(result[result.columns[i]], 
to_arrow_type(field.dataType))
+for i, field in enumerate(return_type)]
+else:
+raise
--- End diff --

@viirya I think that it's just that it is very common for users to create a 
DataFrame with a dict using names as keys and not know that this can change the 
order of columns.  So even if the field types all match (in the case of this 
JIRA they were all StringTypes), there could be a mix up between the data and 
column names.  This is really weird and hard to figure out what is going on 
from the user perspective.

When defining the pandas_udf, the return type requires the field names, so 
if the returned DataFrame has columns indexed by strings, I think it's fair to 
assume that if they do not match it was a mistake.  If the user wants to use 
positional columns, they can index by integers - and I'll add this to the docs.

That being said, I do suppose that this slightly changes the behavior if by 
chance the user had gone out of the way to make a pandas_udf by specifying 
columns with different names than the return type schema, but still with the 
same field type order.  That seems pretty unlikely to me though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21450: [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution w...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21450
  
**[Test build #91267 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91267/testReport)**
 for PR 21450 at commit 
[`a69850b`](https://github.com/apache/spark/commit/a69850b6fdcbe2e234e70a597d9ad6beae6a6937).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21450: [SPARK-24319][SPARK SUBMIT] Fix spark-submit execution w...

2018-05-29 Thread gaborgsomogyi

Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/21450
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-29 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21416


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...

2018-05-29 Thread dbtsai

Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/21416
  
Merged into master. Thank you everyone for reviewing.

Followup PR will be created for

1. Adding tests in Java.
2. Adding docs about automagical type casting.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21452: [MINOR][CORE] Log committer class used by HadoopMapRedCo...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21452
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-29 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191506417
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
@@ -786,6 +787,24 @@ class Column(val expr: Expression) extends Logging {
   @scala.annotation.varargs
   def isin(list: Any*): Column = withExpr { In(expr, 
list.map(lit(_).expr)) }
--- End diff --

+1 Let's do it in the followup PR. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21452: [MINOR][CORE] Log committer class used by HadoopMapRedCo...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21452
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91266/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #91266 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91266/testReport)**
 for PR 21451 at commit 
[`7e517e4`](https://github.com/apache/spark/commit/7e517e4ea0ff66dc57121b54fdd71f8391edd8f2).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class UploadBlockStream extends BlockTransferMessage `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-29 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191505830
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
@@ -390,11 +394,67 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(df.filter($"b".isin("z", "y")),
   df.collect().toSeq.filter(r => r.getString(1) == "z" || 
r.getString(1) == "y"))
 
+// Auto casting should work with mixture of different types in 
collections
+checkAnswer(df.filter($"a".isin(1.toShort, "2")),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isin("3", 2.toLong)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isin(3, "1")),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
 val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
 
-intercept[AnalysisException] {
+val e = intercept[AnalysisException] {
   df2.filter($"a".isin($"b"))
 }
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Scala Collection") {
+val df = Seq((1, "x"), (2, "y"), (3, "z")).toDF("a", "b")
+// Test with different types of collections
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 1))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+checkAnswer(df.filter($"a".isInCollection(Seq(1, 2).toSet)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 2).toArray)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 1).toList)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
+val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
+
+val e = intercept[AnalysisException] {
+  df2.filter($"a".isInCollection(Seq($"b")))
+}
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Java Collection") {
--- End diff --

I totally agree with you that we should have tests natively in Java instead 
of converting the types to Java in Scala and hope the best that it will work in 
Java. Let's do it in the followup PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21452: [MINOR][CORE] Log committer class used by HadoopM...

2018-05-29 Thread ejono

GitHub user ejono opened a pull request:

https://github.com/apache/spark/pull/21452

[MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol

## What changes were proposed in this pull request?

When HadoopMapRedCommitProtocol is used (e.g., when using saveAsTextFile() 
or
saveAsHadoopFile() with RDDs), it's not easy to determine which output 
committer
class was used, so this PR simply logs the class that was used, similarly 
to what
is done in SQLHadoopMapReduceCommitProtocol.

## How was this patch tested?

Built Spark then manually inspected logging when calling saveAsTextFile():

```scala
scala> sc.setLogLevel("INFO")
scala> sc.textFile("README.md").saveAsTextFile("/tmp/out")
...
18/05/29 10:06:20 INFO HadoopMapRedCommitProtocol: Using output committer 
class org.apache.hadoop.mapred.FileOutputCommitter
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ejono/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21452.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21452


commit 9881d9c6a2b1d56e69bb06ee27fd8706f6e0fe43
Author: Jonathan Kelly 
Date:   2018-05-29T16:36:02Z

[MINOR][CORE] Log committer class used by HadoopMapRedCommitProtocol

When HadoopMapRedCommitProtocol is used (e.g., when using saveAsTextFile() 
or
saveAsHadoopFile() with RDDs), it's not easy to determine which output 
committer
class was used, so this PR simply logs the class that was used, similarly 
to what
is done in SQLHadoopMapReduceCommitProtocol.

Built Spark then manually inspected logging when calling saveAsTextFile():

```scala
scala> sc.setLogLevel("INFO")
scala> sc.textFile("README.md").saveAsTextFile("/tmp/out")
...
18/05/29 10:06:20 INFO HadoopMapRedCommitProtocol: Using output committer 
class org.apache.hadoop.mapred.FileOutputCommitter
```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21451
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3677/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

2018-05-29 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/21427#discussion_r191503646
  
--- Diff: python/pyspark/worker.py ---
@@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
 "Number of columns of the returned pandas.DataFrame "
 "doesn't match specified schema. "
 "Expected: {} Actual: {}".format(len(return_type), 
len(result.columns)))
-arrow_return_types = (to_arrow_type(field.dataType) for field in 
return_type)
-return [(result[result.columns[i]], arrow_type)
-for i, arrow_type in enumerate(arrow_return_types)]
+try:
+# Assign result columns by schema name
+return [(result[field.name], to_arrow_type(field.dataType)) 
for field in return_type]
+except KeyError:
+if all(not isinstance(name, basestring) for name in 
result.columns):
+# Assign result columns by position if they are not named 
with strings
+return [(result[result.columns[i]], 
to_arrow_type(field.dataType))
+for i, field in enumerate(return_type)]
+else:
+raise
--- End diff --

I think when user specify column names explicitly on the returned 
pd.DataFrame but it doesn't match the schema, then it's most likely to be a bug 
/ typo, so throw exception makes sense to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

2018-05-29 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21427#discussion_r191502476
  
--- Diff: python/pyspark/worker.py ---
@@ -111,9 +114,16 @@ def wrapped(key_series, value_series):
 "Number of columns of the returned pandas.DataFrame "
 "doesn't match specified schema. "
 "Expected: {} Actual: {}".format(len(return_type), 
len(result.columns)))
-arrow_return_types = (to_arrow_type(field.dataType) for field in 
return_type)
-return [(result[result.columns[i]], arrow_type)
-for i, arrow_type in enumerate(arrow_return_types)]
+try:
+# Assign result columns by schema name
+return [(result[field.name], to_arrow_type(field.dataType)) 
for field in return_type]
+except KeyError:
+if all(not isinstance(name, basestring) for name in 
result.columns):
--- End diff --

Yeah, we still need to check for the possibility that python 2 uses unicode.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21366: [SPARK-24248][K8S] Use the Kubernetes API to popu...

2018-05-29 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/21366#discussion_r191502285
  
--- Diff: pom.xml ---
@@ -760,6 +760,12 @@
 1.10.19
 test
   
+  
--- End diff --

> I think I'm a bit concerned adding rxjava to the top level pom and to 
dev/deps/spark-deps-hadoop-*
can it be just a `0.8.0` thing and not a 
dependency?

Unsure what you mean here - we're using rxjava itself specifically to do 
the event handling in this PR. See 
https://github.com/apache/spark/pull/21366/files#diff-ae4cd884779fb4c3db58958ab984db59R40.
 If we wanted an alternative approach we can build something from first 
principles (executor service / manual linked blocking queues) but I like the 
elegance that rx-java buys us here. The code we'd save building ourselves seems 
worthwhile.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21427: [SPARK-24324][PYTHON] Pandas Grouped Map UDF shou...

2018-05-29 Thread BryanCutler

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/21427#discussion_r191502180
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4931,6 +4931,63 @@ def foo3(key, pdf):
 expected4 = udf3.func((), pdf)
 self.assertPandasEqual(expected4, result4)
 
+def test_column_order(self):
+import pandas as pd
+from pyspark.sql.functions import pandas_udf, PandasUDFType
+df = self.data
+
+# Function returns a pdf with required column names, but order 
could be arbitrary using dict
+def change_col_order(pdf):
+# Constructing a DataFrame from a dict should result in the 
same order,
+# but use from_items to ensure the pdf column order is 
different than schema
+return pd.DataFrame.from_items([
+('id', pdf.id),
+('u', pdf.v * 2),
+('v', pdf.v)])
+
+ordered_udf = pandas_udf(
+change_col_order,
+'id long, v int, u int',
+PandasUDFType.GROUPED_MAP
+)
+
+def positional_col_order(pdf):
--- End diff --

yeah, I'll add a test for an integer index. I don't think we need to 
explicitly only support string or int.  Only if it is not string based, then 
position will be used. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/21451
  
This is the change for SPARK-24296, on top of 
https://github.com/apache/spark/pull/21346 and 
https://github.com/apache/spark/pull/21440

Posting here for testing.  Review are welcome on this commit which has just 
the relevant changes: 
https://github.com/apache/spark/pull/21451/commits/7e517e4ea0ff66dc57121b54fdd71f8391edd8f2


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21451
  
**[Test build #91266 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91266/testReport)**
 for PR 21451 at commit 
[`7e517e4`](https://github.com/apache/spark/commit/7e517e4ea0ff66dc57121b54fdd71f8391edd8f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21451: [SPARK-24296][CORE][WIP] Replicate large blocks a...

2018-05-29 Thread squito

GitHub user squito opened a pull request:

https://github.com/apache/spark/pull/21451

[SPARK-24296][CORE][WIP] Replicate large blocks as a stream.

When replicating large cached RDD blocks, it can be helpful to replicate
them as a stream, to avoid using large amounts of memory during the
transfer.  This also allows blocks larger than 2GB to be replicated.

Added unit tests in DistributedSuite.  Also ran tests on a cluster for
blocks > 2gb.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/squito/spark clean_replication

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21451


commit 05967808f5440835919f02d6c5d0d3563482d304
Author: Imran Rashid 
Date:   2018-05-02T14:55:15Z

[SPARK-6237][NETWORK] Network-layer changes to allow stream upload.

These changes allow an RPCHandler to receive an upload as a stream of
data, without having to buffer the entire message in the FrameDecoder.
The primary use case is for replicating large blocks.

Added unit tests.

commit 43658df6d6b7dffacd528a2573e8846ab6469e81
Author: Imran Rashid 
Date:   2018-05-23T03:59:40Z

[SPARK-24307][CORE] Support reading remote cached partitions > 2gb

(1) Netty's ByteBuf cannot support data > 2gb.  So to transfer data from a
ChunkedByteBuffer over the network, we use a custom version of
FileRegion which is backed by the ChunkedByteBuffer.

(2) On the receiving end, we need to expose all the data in a
FileSegmentManagedBuffer as a ChunkedByteBuffer.  We do that by memory
mapping the entire file in chunks.

Added unit tests.  Also tested on a cluster with remote cache reads >
2gb (in memory and on disk).

commit 7e517e4ea0ff66dc57121b54fdd71f8391edd8f2
Author: Imran Rashid 
Date:   2018-05-15T16:48:51Z

[SPARK-24296][CORE] Replicate large blocks as a stream.

When replicating large cached RDD blocks, it can be helpful to replicate
them as a stream, to avoid using large amounts of memory during the
transfer.  This also allows blocks larger than 2GB to be replicated.

Added unit tests in DistributedSuite.  Also ran tests on a cluster for
blocks > 2gb.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21416
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21416
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91254/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-05-29 Thread edwinalu

Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r191494940
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/PeakExecutorMetrics.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import org.apache.spark.executor.ExecutorMetrics
+import org.apache.spark.status.api.v1.PeakMemoryMetrics
+
+/**
+ * Records the peak values for executor level metrics. If 
jvmUsedHeapMemory is -1, then no
+ * values have been recorded yet.
+ */
+private[spark] class PeakExecutorMetrics {
+  private var _jvmUsedHeapMemory = -1L;
+  private var _jvmUsedNonHeapMemory = 0L;
+  private var _onHeapExecutionMemory = 0L
+  private var _offHeapExecutionMemory = 0L
+  private var _onHeapStorageMemory = 0L
+  private var _offHeapStorageMemory = 0L
+  private var _onHeapUnifiedMemory = 0L
+  private var _offHeapUnifiedMemory = 0L
+  private var _directMemory = 0L
+  private var _mappedMemory = 0L
+
+  def jvmUsedHeapMemory: Long = _jvmUsedHeapMemory
+
+  def jvmUsedNonHeapMemory: Long = _jvmUsedNonHeapMemory
+
+  def onHeapExecutionMemory: Long = _onHeapExecutionMemory
+
+  def offHeapExecutionMemory: Long = _offHeapExecutionMemory
+
+  def onHeapStorageMemory: Long = _onHeapStorageMemory
+
+  def offHeapStorageMemory: Long = _offHeapStorageMemory
+
+  def onHeapUnifiedMemory: Long = _onHeapUnifiedMemory
+
+  def offHeapUnifiedMemory: Long = _offHeapUnifiedMemory
+
+  def directMemory: Long = _directMemory
+
+  def mappedMemory: Long = _mappedMemory
+
+  /**
+   * Compare the specified memory values with the saved peak executor 
memory
+   * values, and update if there is a new peak value.
+   *
+   * @param executorMetrics the executor metrics to compare
+   * @return if there is a new peak value for any metric
+   */
+  def compareAndUpdate(executorMetrics: ExecutorMetrics): Boolean = {
+var updated: Boolean = false
+
+if (executorMetrics.jvmUsedHeapMemory > _jvmUsedHeapMemory) {
+  _jvmUsedHeapMemory = executorMetrics.jvmUsedHeapMemory
+  updated = true
+}
+if (executorMetrics.jvmUsedNonHeapMemory > _jvmUsedNonHeapMemory) {
+  _jvmUsedNonHeapMemory = executorMetrics.jvmUsedNonHeapMemory
+  updated = true
+}
+if (executorMetrics.onHeapExecutionMemory > _onHeapExecutionMemory) {
+  _onHeapExecutionMemory = executorMetrics.onHeapExecutionMemory
+  updated = true
+}
+if (executorMetrics.offHeapExecutionMemory > _offHeapExecutionMemory) {
+  _offHeapExecutionMemory = executorMetrics.offHeapExecutionMemory
+  updated = true
+}
+if (executorMetrics.onHeapStorageMemory > _onHeapStorageMemory) {
+  _onHeapStorageMemory = executorMetrics.onHeapStorageMemory
+  updated = true
+}
+if (executorMetrics.offHeapStorageMemory > _offHeapStorageMemory) {
+  _offHeapStorageMemory = executorMetrics.offHeapStorageMemory
--- End diff --

Will do. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21416
  
**[Test build #91254 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91254/testReport)**
 for PR 21416 at commit 
[`fed2846`](https://github.com/apache/spark/commit/fed2846fe7c9ca2cb4534b23803cd29d5a18d4f9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

2018-05-29 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20697
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3543/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-29 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191491943
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
@@ -786,6 +787,24 @@ class Column(val expr: Expression) extends Logging {
   @scala.annotation.varargs
   def isin(list: Any*): Column = withExpr { In(expr, 
list.map(lit(_).expr)) }
--- End diff --

I know this is not your change, but I think (both here and bellow) 
something about the automagical type casting thats going on should be in the 
docstring/scaladoc/javadoc because to me its a little surprising how this will 
compare integers to strings and silently convert the types including if there 
are no strings which can be converted to integers. And I'd also include that in 
the isInCollection docstring/scaladoc/javadoc bellow.

I'd also point out that the result of the conversion needs to be of the 
same type and not of a sequence of the type (although the error message we get 
is pretty clear so your call).

Just a suggestion for improvement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-29 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191488193
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala ---
@@ -390,11 +394,67 @@ class ColumnExpressionSuite extends QueryTest with 
SharedSQLContext {
 checkAnswer(df.filter($"b".isin("z", "y")),
   df.collect().toSeq.filter(r => r.getString(1) == "z" || 
r.getString(1) == "y"))
 
+// Auto casting should work with mixture of different types in 
collections
+checkAnswer(df.filter($"a".isin(1.toShort, "2")),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isin("3", 2.toLong)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isin(3, "1")),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
 val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
 
-intercept[AnalysisException] {
+val e = intercept[AnalysisException] {
   df2.filter($"a".isin($"b"))
 }
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Scala Collection") {
+val df = Seq((1, "x"), (2, "y"), (3, "z")).toDF("a", "b")
+// Test with different types of collections
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 1))),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+checkAnswer(df.filter($"a".isInCollection(Seq(1, 2).toSet)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 1 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 2).toArray)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 2))
+checkAnswer(df.filter($"a".isInCollection(Seq(3, 1).toList)),
+  df.collect().toSeq.filter(r => r.getInt(0) == 3 || r.getInt(0) == 1))
+
+val df2 = Seq((1, Seq(1)), (2, Seq(2)), (3, Seq(3))).toDF("a", "b")
+
+val e = intercept[AnalysisException] {
+  df2.filter($"a".isInCollection(Seq($"b")))
+}
+Seq("cannot resolve", "due to data type mismatch: Arguments must be 
same type but were")
+  .foreach { s =>
+
assert(e.getMessage.toLowerCase(Locale.ROOT).contains(s.toLowerCase(Locale.ROOT)))
+  }
+  }
+
+  test("isInCollection: Java Collection") {
--- End diff --

As stated up above, maybe this would make sense to do in Java, but your 
call.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...

2018-05-29 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21416#discussion_r191486882
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Column.scala ---
@@ -786,6 +787,24 @@ class Column(val expr: Expression) extends Logging {
   @scala.annotation.varargs
   def isin(list: Any*): Column = withExpr { In(expr, 
list.map(lit(_).expr)) }
 
+  /**
+   * A boolean expression that is evaluated to true if the value of this 
expression is contained
+   * by the provided collection.
+   *
+   * @group expr_ops
+   * @since 2.4.0
+   */
+  def isInCollection(values: scala.collection.Iterable[_]): Column = 
isin(values.toSeq: _*)
+
+  /**
+   * A boolean expression that is evaluated to true if the value of this 
expression is contained
+   * by the provided collection.
+   *
+   * @group java_expr_ops
+   * @since 2.4.0
+   */
+  def isInCollection(values: java.lang.Iterable[_]): Column = 
isInCollection(values.asScala)
--- End diff --

Not that we need it for sure, but in the past some of our Java APIs have 
been difficult to call from Java and I think that if were making an API 
designed to be called from Java it might make sense to have a test case for it 
in Java. Here I understand it's not super important, so just a suggestion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental

2018-05-29 Thread BryanCutler

Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/21435
  
Thanks @HyukjinKwon and @viirya !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20697
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20697
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3676/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19691: [SPARK-14922][SPARK-17732][SQL]ALTER TABLE DROP P...

2018-05-29 Thread DazhuangSu

Github user DazhuangSu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19691#discussion_r191492537
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
 ---
@@ -282,6 +282,26 @@ class AstBuilder(conf: SQLConf) extends 
SqlBaseBaseVisitor[AnyRef] with Logging
 parts.toMap
   }
 
+  /**
+   * Create a partition filter specification.
+   */
+  def visitPartitionFilterSpec(ctx: PartitionSpecContext): Seq[Expression] 
= withOrigin(ctx) {
+val parts = ctx.expression.asScala.map { pVal =>
+  expression(pVal) match {
+case EqualNullSafe(_, _) =>
+  throw new ParseException("'<=>' operator is not allowed in 
partition specification.", ctx)
+case cmp @ BinaryComparison(UnresolvedAttribute(name :: Nil), 
constant: Literal) =>
+  cmp
+case bc @ BinaryComparison(constant: Literal, _) =>
+  throw new ParseException("Literal " + constant
--- End diff --

Sorry,  I was careless. Will fix this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21426: [SPARK-24384][PYTHON][SPARK SUBMIT] Add .py files...

2018-05-29 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21426#discussion_r191491127
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala ---
@@ -153,4 +154,30 @@ object PythonRunner {
   .map { p => formatPath(p, testWindows) }
   }
 
+  /**
+   * Resolves the ".py" files. ".py" file should not be added as is 
because PYTHONPATH does
+   * not expect a file. This method creates a temporary directory and puts 
the ".py" files
+   * if exist in the given paths.
+   */
+  private def resolvePyFiles(pyFiles: Array[String]): Array[String] = {
+val dest = Utils.createTempDir(namePrefix = "localPyFiles")
+pyFiles.flatMap { pyFile =>
+  // In case of client with submit, the python paths should be set 
before context
+  // initialization because the context initialization can be done 
later.
+  // We will copy the local ".py" files because ".py" file shouldn't 
be added
+  // alone but its parent directory in PYTHONPATH. See SPARK-24384.
+  if (pyFile.endsWith(".py")) {
+val source = new File(pyFile)
+if (source.exists() && source.canRead) {
--- End diff --

I think providing a non-existent file to spark-submit should result in an 
error. Whether the error happens here or somewhere else it doesn't really 
matter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21061: [SPARK-23914][SQL] Add array_union function

2018-05-29 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21061
  
ping @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-05-29 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r191489062
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/PeakExecutorMetrics.scala ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import org.apache.spark.executor.ExecutorMetrics
+import org.apache.spark.status.api.v1.PeakMemoryMetrics
+
+/**
+ * Records the peak values for executor level metrics. If 
jvmUsedHeapMemory is -1, then no
+ * values have been recorded yet.
+ */
+private[spark] class PeakExecutorMetrics {
+  private var _jvmUsedHeapMemory = -1L;
+  private var _jvmUsedNonHeapMemory = 0L;
+  private var _onHeapExecutionMemory = 0L
+  private var _offHeapExecutionMemory = 0L
+  private var _onHeapStorageMemory = 0L
+  private var _offHeapStorageMemory = 0L
+  private var _onHeapUnifiedMemory = 0L
+  private var _offHeapUnifiedMemory = 0L
+  private var _directMemory = 0L
+  private var _mappedMemory = 0L
+
+  def jvmUsedHeapMemory: Long = _jvmUsedHeapMemory
+
+  def jvmUsedNonHeapMemory: Long = _jvmUsedNonHeapMemory
+
+  def onHeapExecutionMemory: Long = _onHeapExecutionMemory
+
+  def offHeapExecutionMemory: Long = _offHeapExecutionMemory
+
+  def onHeapStorageMemory: Long = _onHeapStorageMemory
+
+  def offHeapStorageMemory: Long = _offHeapStorageMemory
+
+  def onHeapUnifiedMemory: Long = _onHeapUnifiedMemory
+
+  def offHeapUnifiedMemory: Long = _offHeapUnifiedMemory
+
+  def directMemory: Long = _directMemory
+
+  def mappedMemory: Long = _mappedMemory
+
+  /**
+   * Compare the specified memory values with the saved peak executor 
memory
+   * values, and update if there is a new peak value.
+   *
+   * @param executorMetrics the executor metrics to compare
+   * @return if there is a new peak value for any metric
+   */
+  def compareAndUpdate(executorMetrics: ExecutorMetrics): Boolean = {
+var updated: Boolean = false
+
+if (executorMetrics.jvmUsedHeapMemory > _jvmUsedHeapMemory) {
+  _jvmUsedHeapMemory = executorMetrics.jvmUsedHeapMemory
+  updated = true
+}
+if (executorMetrics.jvmUsedNonHeapMemory > _jvmUsedNonHeapMemory) {
+  _jvmUsedNonHeapMemory = executorMetrics.jvmUsedNonHeapMemory
+  updated = true
+}
+if (executorMetrics.onHeapExecutionMemory > _onHeapExecutionMemory) {
+  _onHeapExecutionMemory = executorMetrics.onHeapExecutionMemory
+  updated = true
+}
+if (executorMetrics.offHeapExecutionMemory > _offHeapExecutionMemory) {
+  _offHeapExecutionMemory = executorMetrics.offHeapExecutionMemory
+  updated = true
+}
+if (executorMetrics.onHeapStorageMemory > _onHeapStorageMemory) {
+  _onHeapStorageMemory = executorMetrics.onHeapStorageMemory
+  updated = true
+}
+if (executorMetrics.offHeapStorageMemory > _offHeapStorageMemory) {
+  _offHeapStorageMemory = executorMetrics.offHeapStorageMemory
--- End diff --

The more you can take it over from here, the better :) But let me know if 
there is anything which is confusing, or if the TODOs that I've left actually 
don't seem possible etc. and I can take a closer look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

2018-05-29 Thread ssuchter

Github user ssuchter commented on the issue:

https://github.com/apache/spark/pull/20697
  
Ok, I think all open issues have been resolved. The PRB failures are 
because of github request failures, so they are spurious. @vanzin @mccheah I 
think it's ready for another look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

2018-05-29 Thread ssuchter

Github user ssuchter commented on the issue:

https://github.com/apache/spark/pull/20697
  
Ok, I think all open issues have been resolved. The PRB failures are 
because of github request failures, so they are spurious. @vanzin @mccheah I 
think it's ready for another look.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20697: [SPARK-23010][k8s] Initial checkin of k8s integra...

2018-05-29 Thread ssuchter

Github user ssuchter commented on a diff in the pull request:

https://github.com/apache/spark/pull/20697#discussion_r191487129
  
--- Diff: 
resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesTestComponents.scala
 ---
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.integrationtest
+
+import java.nio.file.{Path, Paths}
+import java.util.UUID
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import io.fabric8.kubernetes.client.DefaultKubernetesClient
+import org.scalatest.concurrent.Eventually
+
+private[spark] class KubernetesTestComponents(defaultClient: 
DefaultKubernetesClient) {
+
+  val namespaceOption = 
Option(System.getProperty("spark.kubernetes.test.namespace"))
+  val hasUserSpecifiedNamespace = namespaceOption.isDefined
+  val namespace = 
namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", ""))
+  private val serviceAccountName =
+Option(System.getProperty("spark.kubernetes.test.serviceAccountName"))
+  .getOrElse("default")
+  val kubernetesClient = defaultClient.inNamespace(namespace)
+  val clientConfig = kubernetesClient.getConfiguration
+
+  def createNamespace(): Unit = {
+defaultClient.namespaces.createNew()
+  .withNewMetadata()
+  .withName(namespace)
+  .endMetadata()
+  .done()
+  }
+
+  def deleteNamespace(): Unit = {
+defaultClient.namespaces.withName(namespace).delete()
+Eventually.eventually(KubernetesSuite.TIMEOUT, 
KubernetesSuite.INTERVAL) {
+  val namespaceList = defaultClient
+.namespaces()
+.list()
+.getItems
+.asScala
+  require(!namespaceList.exists(_.getMetadata.getName == namespace))
+}
+  }
+
+  def newSparkAppConf(): SparkAppConf = {
+new SparkAppConf()
+  .set("spark.master", s"k8s://${kubernetesClient.getMasterUrl}")
+  .set("spark.kubernetes.namespace", namespace)
+  .set("spark.executor.memory", "500m")
+  .set("spark.executor.cores", "1")
+  .set("spark.executors.instances", "1")
+  .set("spark.app.name", "spark-test-app")
+  .set("spark.ui.enabled", "true")
+  .set("spark.testing", "false")
+  .set("spark.kubernetes.submission.waitAppCompletion", "false")
+  .set("spark.kubernetes.authenticate.driver.serviceAccountName", 
serviceAccountName)
+  }
+}
+
+private[spark] class SparkAppConf {
+
+  private val map = mutable.Map[String, String]()
+
+  def set(key: String, value: String): SparkAppConf = {
+map.put(key, value)
+this
+  }
+
+  def get(key: String): String = map.getOrElse(key, "")
+
+  def setJars(jars: Seq[String]): Unit = set("spark.jars", 
jars.mkString(","))
+
+  override def toString: String = map.toString
+
+  def toStringArray: Iterable[String] = map.toList.flatMap(t => 
List("--conf", s"${t._1}=${t._2}"))
+}
+
+private[spark] case class SparkAppArguments(
+mainAppResource: String,
+mainClass: String,
+appArgs: Array[String])
+
+private[spark] object SparkAppLauncher extends Logging {
--- End diff --

This one is so much smaller (< 10 lines of executable code) than 
SparkLauncher, I think we should not try to switch in this CL.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20697: [SPARK-23010][k8s] Initial checkin of k8s integration te...

2018-05-29 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20697
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 >

201 - 300 of 510 matches

Mail list logo