[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21369
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3549/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19602: [SPARK-22384][SQL] Refine partition pruning when attribu...

2018-05-24 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19602
  
@cloud-fan
Thanks a lot for looking into this.
I updated the change and generalized `ExtractAttribute`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spi...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21369
  
**[Test build #91101 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91101/testReport)**
 for PR 21369 at commit 
[`bc7dc11`](https://github.com/apache/spark/commit/bc7dc11383db8370f755a058f4b908588f93edc8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3548/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21246
  
**[Test build #91100 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91100/testReport)**
 for PR 21246 at commit 
[`6fd8f2f`](https://github.com/apache/spark/commit/6fd8f2fbd37e5193f0ffb1a25a8f4a8c71ab55bd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21246
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21420
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91090/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21420
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21420
  
**[Test build #91090 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91090/testReport)**
 for PR 21420 at commit 
[`a41c99b`](https://github.com/apache/spark/commit/a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...

2018-05-24 Thread eyalfa
Github user eyalfa commented on a diff in the pull request:

https://github.com/apache/spark/pull/21369#discussion_r190542635
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala
 ---
@@ -414,6 +415,99 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite 
with LocalSparkContext {
 sc.stop()
   }
 
+  test("spill during iteration") {
+val size = 1000
+val conf = createSparkConf(loadDefaults = true)
+sc = new SparkContext("local-cluster[1,1,1024]", "test", conf)
+val map = createExternalMap[Int]
+
+map.insertAll((0 until size).iterator.map(i => (i / 10, i)))
+assert(map.numSpills == 0, "map was not supposed to spill")
+
+val it = map.iterator
+assert( it.isInstanceOf[CompletionIterator[_, _]])
+val underlyingIt = map.readingIterator
+assert( underlyingIt != null )
+val underlyingMapIterator = underlyingIt.upstream
+assert(underlyingMapIterator != null)
+val underlyingMapIteratorClass = underlyingMapIterator.getClass
+assert(underlyingMapIteratorClass.getEnclosingClass == 
classOf[AppendOnlyMap[_, _]])
+
+val underlyingMap = map.currentMap
+assert(underlyingMap != null)
+
+val first50Keys = for ( _ <- 0 until 50) yield {
+  val (k, vs) = it.next
+  val sortedVs = vs.sorted
+  assert(sortedVs.seq == (0 until 10).map(10 * k + _))
+  k
+}
+assert( map.numSpills == 0 )
+map.spill(Long.MaxValue, null)
+// these asserts try to show that we're no longer holding references 
to the underlying map.
+// it'd be nice to use something like
+// 
https://github.com/scala/scala/blob/2.13.x/test/junit/scala/tools/testing/AssertUtil.scala
+// (lines 69-89)
+assert(map.currentMap == null)
+assert(underlyingIt.upstream ne underlyingMapIterator)
+assert(underlyingIt.upstream.getClass != underlyingMapIteratorClass)
+assert(underlyingIt.upstream.getClass.getEnclosingClass != 
classOf[AppendOnlyMap[_, _]])
--- End diff --

hmm, we can in line 508 but not in this test.
in this test we look at the iterator immediately after a spill, at this 
point upstream is supposed to be replaced by a `DiskMapIterator`, I guess we 
can check for this directly (after relaxing its visibility to package private).

in line 508, we can simply compare with Iterator.empty


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...

2018-05-24 Thread gentlewangyu
Github user gentlewangyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21422#discussion_r190541918
  
--- Diff: docs/building-spark.md ---
@@ -92,10 +92,10 @@ like ZooKeeper and Hadoop itself.
 ./build/mvn -Pmesos -DskipTests clean package
 
 ## Building for Scala 2.10
-To produce a Spark package compiled with Scala 2.10, use the 
`-Dscala-2.10` property:
+To produce a Spark package compiled with Scala 2.10, use the 
`-Pscala-2.10` property:
 
 ./dev/change-scala-version.sh 2.10
-./build/mvn -Pyarn -Dscala-2.10 -DskipTests clean package
+./build/mvn -Pyarn -scala-2.10 -DskipTestsP clean package
--- End diff --

sorry , It's -Pscala-2.10


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21391
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21391
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91093/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21391
  
**[Test build #91093 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91093/testReport)**
 for PR 21391 at commit 
[`d9a440a`](https://github.com/apache/spark/commit/d9a440a9814913827fcfcff644c741a43332b02d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91095/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21246
  
**[Test build #91095 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91095/testReport)**
 for PR 21246 at commit 
[`6fd8f2f`](https://github.com/apache/spark/commit/6fd8f2fbd37e5193f0ffb1a25a8f4a8c71ab55bd).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21422
  
Minor fix doesn't usually need a JIRA. Let's avoid next time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...

2018-05-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21422#discussion_r190525438
  
--- Diff: docs/building-spark.md ---
@@ -92,10 +92,10 @@ like ZooKeeper and Hadoop itself.
 ./build/mvn -Pmesos -DskipTests clean package
 
 ## Building for Scala 2.10
-To produce a Spark package compiled with Scala 2.10, use the 
`-Dscala-2.10` property:
+To produce a Spark package compiled with Scala 2.10, use the 
`-Pscala-2.10` property:
 
 ./dev/change-scala-version.sh 2.10
-./build/mvn -Pyarn -Dscala-2.10 -DskipTests clean package
+./build/mvn -Pyarn -scala-2.10 -DskipTestsP clean package
--- End diff --

DskipTestsP?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error i...

2018-05-24 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/21398#discussion_r190521452
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -230,11 +232,29 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // specify location for managed table. And in 
[[CreateDataSourceTableAsSelectCommand]] we have
 // to create the table directory and write out data before we create 
this table, to avoid
 // exposing a partial written table.
-val needDefaultTableLocation = tableDefinition.tableType == MANAGED &&
-  tableDefinition.storage.locationUri.isEmpty
-
-val tableLocation = if (needDefaultTableLocation) {
-  
Some(CatalogUtils.stringToURI(defaultTablePath(tableDefinition.identifier)))
+//
+// When using a remote metastore, and if a managed table is being 
created with its
+// location explicitly set to the location where it would be created 
anyway, then do
+// not set its location explicitly. This avoids an issue with Sentry 
in secure clusters.
+// Otherwise, the above comment applies.
--- End diff --

>  The question is whether we want this workaround in Spark while Sentry is 
not fixed.
We could open a bug for Sentry and since Spark uses hive which often comes 
with Sentry in enterprise envs then it would be good to point to that bug in 
the code , so when things get fixed we could change bug the code here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21331: [SPARK-24276][SQL] Order of literals in IN should not af...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21331
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3547/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21331: [SPARK-24276][SQL] Order of literals in IN should not af...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21331
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21317
  
Kubernetes integration test status success
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3425/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21416
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21317
  
Kubernetes integration test starting
URL: 
https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3425/



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21317
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21317
  
**[Test build #91098 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91098/testReport)**
 for PR 21317 at commit 
[`ac544f8`](https://github.com/apache/spark/commit/ac544f836a867da2c709fbfe6e0a7336e1d026c1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21416
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91091/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21317
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91098/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21331: [SPARK-24276][SQL] Order of literals in IN should not af...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21331
  
**[Test build #91099 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91099/testReport)**
 for PR 21331 at commit 
[`a0af525`](https://github.com/apache/spark/commit/a0af52524e30a9ace9d9a6239de79a7251a2499c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21317
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21416
  
**[Test build #91091 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91091/testReport)**
 for PR 21416 at commit 
[`de3296c`](https://github.com/apache/spark/commit/de3296c116b07d6560b1f89ca97c2014d36d10a8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21317
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3546/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21331: [SPARK-24276][SQL] Order of literals in IN should...

2018-05-24 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21331#discussion_r190518402
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala
 ---
@@ -85,6 +87,10 @@ object Canonicalize {
 case Not(GreaterThanOrEqual(l, r)) => LessThan(l, r)
 case Not(LessThanOrEqual(l, r)) => GreaterThan(l, r)
 
+// order the list in the In operator
+case In(value, list) =>
+  In(value, list.sortBy(_.semanticHash()))
--- End diff --

thanks for your great comment. This is very helpful and I'll keep in mind 
for the future. I think I addressed all the points now. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/21317
  
@liyinan926 ready for another round. I think its ready for merge.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21317: [SPARK-24232][k8s] Add support for secret env vars

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21317
  
**[Test build #91098 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91098/testReport)**
 for PR 21317 at commit 
[`ac544f8`](https://github.com/apache/spark/commit/ac544f836a867da2c709fbfe6e0a7336e1d026c1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21317: [SPARK-24232][k8s] Add support for secret env var...

2018-05-24 Thread skonto
Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/21317#discussion_r190514132
  
--- Diff: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/features/EnvSecretsFeatureStepSuite.scala
 ---
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.deploy.k8s.features
+
+import io.fabric8.kubernetes.api.model.PodBuilder
+
+import org.apache.spark.{SparkConf, SparkFunSuite}
+import org.apache.spark.deploy.k8s._
+
+class EnvSecretsFeatureStepSuite extends SparkFunSuite{
+  private val KEY_REF_NAME_FOO = "foo"
+  private val KEY_REF_NAME_BAR = "bar"
+  private val KEY_REF_KEY_FOO = "key_foo"
+  private val KEY_REF_KEY_BAR = "key_bar"
+  private val ENV_NAME_FOO = "MY_FOO"
+  private val ENV_NAME_BAR = "MY_bar"
+
+  test("sets up all keyRefs") {
+val baseDriverPod = SparkPod.initialPod()
+val envVarsToKeys = Map(
+  ENV_NAME_BAR -> s"${KEY_REF_NAME_BAR}:${KEY_REF_KEY_BAR}",
+  ENV_NAME_FOO -> s"${KEY_REF_NAME_FOO}:${KEY_REF_KEY_FOO}")
+val sparkConf = new SparkConf(false)
+val kubernetesConf = KubernetesConf(
+  sparkConf,
+  KubernetesExecutorSpecificConf("1", new PodBuilder().build()),
+  "resource-name-prefix",
+  "app-id",
+  Map.empty,
+  Map.empty,
+  Map.empty,
+  envVarsToKeys,
+  Map.empty)
+
+val step = new EnvSecretsFeatureStep(kubernetesConf)
+val driverContainerWithEnvSecrets = 
step.configurePod(baseDriverPod).container
+
+val expectedVars =
+  Seq(s"${ENV_NAME_BAR}", s"${ENV_NAME_FOO}")
+
+expectedVars.foreach { envName =>
+  
assert(SecretEnvUtils.containerHasEnvVar(driverContainerWithEnvSecrets, 
envName))
--- End diff --

As we discussed this is K8s responsibility we shouldnt verify the values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21295
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21295
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3545/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21295
  
**[Test build #91097 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91097/testReport)**
 for PR 21295 at commit 
[`497bdd8`](https://github.com/apache/spark/commit/497bdd8fc581f3c40ae97eb56d0a5f65e7d42405).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21415: [SPARK-24244][SPARK-24368][SQL] Passing only required co...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21415
  
**[Test build #91096 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91096/testReport)**
 for PR 21415 at commit 
[`0aef16b`](https://github.com/apache/spark/commit/0aef16b5e9017fb398e0df2f3694a1db1f4d7cb8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21295
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21415: [SPARK-24244][SPARK-24368][SQL] Passing only required co...

2018-05-24 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/21415
  
jenkins, retest this, please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21260: [SPARK-23529][K8s] Support mounting volumes

2018-05-24 Thread andrusha
Github user andrusha commented on the issue:

https://github.com/apache/spark/pull/21260
  
@liyinan926 yeap, ready for review. If you think it makes sense to make 
refactors I'm speaking of then I can add those too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...

2018-05-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21405#discussion_r190502747
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -149,6 +152,34 @@ trait Block extends JavaCode {
 
   // Concatenates this block with other block.
   def + (other: Block): Block
+
+  /**
+   * Apply a map function to each java expression codes present in this 
java code, and return a new
+   * java code based on the mapped java expression codes.
+   */
+  def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): 
this.type = {
--- End diff --

Ok. Will add them in next commit. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21422
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-24 Thread gentlewangyu
Github user gentlewangyu commented on the issue:

https://github.com/apache/spark/pull/21422
  
I have verified this patch


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...

2018-05-24 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21405#discussion_r190502024
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -149,6 +152,34 @@ trait Block extends JavaCode {
 
   // Concatenates this block with other block.
   def + (other: Block): Block
+
+  /**
+   * Apply a map function to each java expression codes present in this 
java code, and return a new
+   * java code based on the mapped java expression codes.
+   */
+  def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): 
this.type = {
--- End diff --

I see, can we add some tests for this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3544/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21422: [Spark-24376][doc]Summary:compiling spark with sc...

2018-05-24 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21422

[Spark-24376][doc]Summary:compiling spark with scala-2.10 should use the -P 
parameter instead of -D



 What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)



## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gentlewangyu/spark SPARK-24376

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21422.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21422


commit d5167ed3dd3ce87a8904c2444eaf6c839ea541d3
Author: wangyu 
Date:   2018-05-24T08:06:47Z

git config --global user.email "wan...@cmss.chinamobile.com"

git config --global user.name "gentlewangyu"




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21422: [Spark-24376][doc]Summary:compiling spark with scala-2.1...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21422
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...

2018-05-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21405#discussion_r190501136
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -149,6 +152,34 @@ trait Block extends JavaCode {
 
   // Concatenates this block with other block.
   def + (other: Block): Block
+
+  /**
+   * Apply a map function to each java expression codes present in this 
java code, and return a new
+   * java code based on the mapped java expression codes.
+   */
+  def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): 
this.type = {
--- End diff --

`transformExprValues` will create a new instance `Block` because a block is 
immutable. So once there are change, new block should take new `exprValues`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21246
  
**[Test build #91095 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91095/testReport)**
 for PR 21246 at commit 
[`6fd8f2f`](https://github.com/apache/spark/commit/6fd8f2fbd37e5193f0ffb1a25a8f4a8c71ab55bd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21246
  
**[Test build #91094 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91094/testReport)**
 for PR 21246 at commit 
[`fca3af8`](https://github.com/apache/spark/commit/fca3af8d7ed57b347bec046ab1ca7b2227286fec).
 * This patch **fails Java style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91094/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3543/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21246
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21405: [SPARK-24361][SQL] Polish code block manipulation...

2018-05-24 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21405#discussion_r190497595
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/javaCode.scala
 ---
@@ -149,6 +152,34 @@ trait Block extends JavaCode {
 
   // Concatenates this block with other block.
   def + (other: Block): Block
+
+  /**
+   * Apply a map function to each java expression codes present in this 
java code, and return a new
+   * java code based on the mapped java expression codes.
+   */
+  def transformExprValues(f: PartialFunction[ExprValue, ExprValue]): 
this.type = {
--- End diff --

am I wrong or we are not updating the `exprValues` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21246
  
**[Test build #91094 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91094/testReport)**
 for PR 21246 at commit 
[`fca3af8`](https://github.com/apache/spark/commit/fca3af8d7ed57b347bec046ab1ca7b2227286fec).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21246: [SPARK-23901][SQL] Add masking functions

2018-05-24 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/21246
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21398: [SPARK-24338][SQL] Fixed Hive CREATETABLE error i...

2018-05-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21398#discussion_r190494840
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -230,11 +232,29 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 // specify location for managed table. And in 
[[CreateDataSourceTableAsSelectCommand]] we have
 // to create the table directory and write out data before we create 
this table, to avoid
 // exposing a partial written table.
-val needDefaultTableLocation = tableDefinition.tableType == MANAGED &&
-  tableDefinition.storage.locationUri.isEmpty
-
-val tableLocation = if (needDefaultTableLocation) {
-  
Some(CatalogUtils.stringToURI(defaultTablePath(tableDefinition.identifier)))
+//
+// When using a remote metastore, and if a managed table is being 
created with its
+// location explicitly set to the location where it would be created 
anyway, then do
+// not set its location explicitly. This avoids an issue with Sentry 
in secure clusters.
+// Otherwise, the above comment applies.
--- End diff --

> IIRC you can't change the location of the default database in Hive

good to know that, maybe we should apply the same rule to Spark.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190494243
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/planning/SelectedFieldSuite.scala
 ---
@@ -0,0 +1,432 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.scalatest.BeforeAndAfterAll
+import org.scalatest.exceptions.TestFailedException
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.dsl.plans._
+import org.apache.spark.sql.catalyst.expressions.NamedExpression
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.types._
+
+// scalastyle:off line.size.limit
+class SelectedFieldSuite extends SparkFunSuite with BeforeAndAfterAll {
+  // The test schema as a tree string, i.e. `schema.treeString`
+  // root
+  //  |-- col1: string (nullable = false)
+  //  |-- col2: struct (nullable = true)
+  //  ||-- field1: integer (nullable = true)
+  //  ||-- field2: array (nullable = true)
+  //  |||-- element: integer (containsNull = false)
+  //  ||-- field3: array (nullable = false)
+  //  |||-- element: struct (containsNull = true)
+  //  ||||-- subfield1: integer (nullable = true)
+  //  ||||-- subfield2: integer (nullable = true)
+  //  ||||-- subfield3: array (nullable = true)
+  //  |||||-- element: integer (containsNull = true)
+  //  ||-- field4: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: struct (valueContainsNull = false)
+  //  ||||-- subfield1: integer (nullable = true)
+  //  ||||-- subfield2: array (nullable = true)
+  //  |||||-- element: integer (containsNull = false)
+  //  ||-- field5: array (nullable = false)
+  //  |||-- element: struct (containsNull = true)
+  //  ||||-- subfield1: struct (nullable = false)
+  //  |||||-- subsubfield1: integer (nullable = true)
+  //  |||||-- subsubfield2: integer (nullable = true)
+  //  ||||-- subfield2: struct (nullable = true)
+  //  |||||-- subsubfield1: struct (nullable = true)
+  //  ||||||-- subsubsubfield1: string (nullable = 
true)
+  //  |||||-- subsubfield2: integer (nullable = true)
+  //  ||-- field6: struct (nullable = true)
+  //  |||-- subfield1: string (nullable = false)
+  //  |||-- subfield2: string (nullable = true)
+  //  ||-- field7: struct (nullable = true)
+  //  |||-- subfield1: struct (nullable = true)
+  //  ||||-- subsubfield1: integer (nullable = true)
+  //  ||||-- subsubfield2: integer (nullable = true)
+  //  ||-- field8: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: array (valueContainsNull = false)
+  //  ||||-- element: struct (containsNull = true)
+  //  |||||-- subfield1: integer (nullable = true)
+  //  |||||-- subfield2: array (nullable = true)
+  //  ||||||-- element: integer (containsNull = false)
+  //  ||-- field9: map (nullable = true)
+  //  |||-- key: string
+  //  |||-- value: integer (valueContainsNull = false)
+  //  |-- col3: array (nullable = false)
+  //  ||-- element: struct (containsNull = false)
+  //  |||-- field1: struct (nullable = true)
+  //  ||||-- subfield1: integer (nullable = false)
+  //  ||||-- subfield2: integer (nullable = true)
+  //  |||-- field2: map (nullable = true)
+  //  ||||-- key: string
+  //  ||||-- value: integer (valueContainsN

[GitHub] spark pull request #21369: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks w...

2018-05-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21369#discussion_r190494153
  
--- Diff: 
core/src/test/scala/org/apache/spark/util/collection/ExternalAppendOnlyMapSuite.scala
 ---
@@ -414,6 +415,99 @@ class ExternalAppendOnlyMapSuite extends SparkFunSuite 
with LocalSparkContext {
 sc.stop()
   }
 
+  test("spill during iteration") {
+val size = 1000
+val conf = createSparkConf(loadDefaults = true)
+sc = new SparkContext("local-cluster[1,1,1024]", "test", conf)
+val map = createExternalMap[Int]
+
+map.insertAll((0 until size).iterator.map(i => (i / 10, i)))
+assert(map.numSpills == 0, "map was not supposed to spill")
+
+val it = map.iterator
+assert( it.isInstanceOf[CompletionIterator[_, _]])
+val underlyingIt = map.readingIterator
+assert( underlyingIt != null )
+val underlyingMapIterator = underlyingIt.upstream
+assert(underlyingMapIterator != null)
+val underlyingMapIteratorClass = underlyingMapIterator.getClass
+assert(underlyingMapIteratorClass.getEnclosingClass == 
classOf[AppendOnlyMap[_, _]])
+
+val underlyingMap = map.currentMap
+assert(underlyingMap != null)
+
+val first50Keys = for ( _ <- 0 until 50) yield {
+  val (k, vs) = it.next
+  val sortedVs = vs.sorted
+  assert(sortedVs.seq == (0 until 10).map(10 * k + _))
+  k
+}
+assert( map.numSpills == 0 )
+map.spill(Long.MaxValue, null)
+// these asserts try to show that we're no longer holding references 
to the underlying map.
+// it'd be nice to use something like
+// 
https://github.com/scala/scala/blob/2.13.x/test/junit/scala/tools/testing/AssertUtil.scala
+// (lines 69-89)
+assert(map.currentMap == null)
+assert(underlyingIt.upstream ne underlyingMapIterator)
+assert(underlyingIt.upstream.getClass != underlyingMapIteratorClass)
+assert(underlyingIt.upstream.getClass.getEnclosingClass != 
classOf[AppendOnlyMap[_, _]])
--- End diff --

can we simply check `assert(underlyingIt.upstream eq Iterator.empty)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190493689
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, 
ProjectionOverSchema, SelectedField}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, 
StructField, StructType}
+
+/**
+ * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
+ * [[ParquetRelation]]. By "Parquet column", we mean a column as defined 
in the
+ * Parquet format. In Spark SQL, a root-level Parquet column corresponds 
to a
+ * SQL column, and a nested Parquet column corresponds to a 
[[StructField]].
+ */
+private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan =
+if (SQLConf.get.nestedSchemaPruningEnabled) {
+  apply0(plan)
+} else {
+  plan
+}
+
+  private def apply0(plan: LogicalPlan): LogicalPlan =
+plan transformDown {
+  case op @ PhysicalOperation(projects, filters,
+  l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, 
partitionSchema,
+dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) 
=>
+val projectionFields = projects.flatMap(getFields)
+val filterFields = filters.flatMap(getFields)
+val requestedFields = (projectionFields ++ filterFields).distinct
+
+// If [[requestedFields]] includes a nested field, continue. 
Otherwise,
+// return [[op]]
+if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) 
{
+  val prunedSchema = requestedFields
+.map { case (field, _) => StructType(Array(field)) }
+.reduceLeft(_ merge _)
+  val dataSchemaFieldNames = dataSchema.fieldNames.toSet
+  val prunedDataSchema =
+StructType(prunedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name)))
+
+  // If the data schema is different from the pruned data schema, 
continue. Otherwise,
+  // return [[op]]. We effect this comparison by counting the 
number of "leaf" fields in
+  // each schemata, assuming the fields in [[prunedDataSchema]] 
are a subset of the fields
+  // in [[dataSchema]].
+  if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
+val prunedParquetRelation =
+  hadoopFsRelation.copy(dataSchema = 
prunedDataSchema)(hadoopFsRelation.sparkSession)
+
+// We need to replace the expression ids of the pruned 
relation output attributes
+// with the expression ids of the original relation output 
attributes so that
+// references to the original relation's output are not broken
+val outputIdMap = l.output.map(att => (att.name, 
att.exprId)).toMap
+val prunedRelationOutput =
+  prunedParquetRelation
+.schema
+.toAttributes
+.map {
+  case att if outputIdMap.contains(att.name) =>
+att.withExprId(outputIdMap(att.name))
+  case att => att
+}
+val prunedRelation =
+  l.copy(relation = prunedParquetRelation, output = 
prunedRelationOutput)
+
+  

[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190492424
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, 
ProjectionOverSchema, SelectedField}
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, 
StructField, StructType}
+
+/**
+ * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
+ * [[ParquetRelation]]. By "Parquet column", we mean a column as defined 
in the
+ * Parquet format. In Spark SQL, a root-level Parquet column corresponds 
to a
+ * SQL column, and a nested Parquet column corresponds to a 
[[StructField]].
+ */
+private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan =
+if (SQLConf.get.nestedSchemaPruningEnabled) {
+  apply0(plan)
+} else {
+  plan
+}
+
+  private def apply0(plan: LogicalPlan): LogicalPlan =
+plan transformDown {
+  case op @ PhysicalOperation(projects, filters,
+  l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, 
partitionSchema,
+dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _, _)) 
=>
+val projectionFields = projects.flatMap(getFields)
+val filterFields = filters.flatMap(getFields)
+val requestedFields = (projectionFields ++ filterFields).distinct
+
+// If [[requestedFields]] includes a nested field, continue. 
Otherwise,
+// return [[op]]
+if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) 
{
+  val prunedSchema = requestedFields
+.map { case (field, _) => StructType(Array(field)) }
+.reduceLeft(_ merge _)
+  val dataSchemaFieldNames = dataSchema.fieldNames.toSet
+  val prunedDataSchema =
+StructType(prunedSchema.filter(f => 
dataSchemaFieldNames.contains(f.name)))
+
+  // If the data schema is different from the pruned data schema, 
continue. Otherwise,
+  // return [[op]]. We effect this comparison by counting the 
number of "leaf" fields in
+  // each schemata, assuming the fields in [[prunedDataSchema]] 
are a subset of the fields
+  // in [[dataSchema]].
+  if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
+val prunedParquetRelation =
+  hadoopFsRelation.copy(dataSchema = 
prunedDataSchema)(hadoopFsRelation.sparkSession)
+
+// We need to replace the expression ids of the pruned 
relation output attributes
+// with the expression ids of the original relation output 
attributes so that
+// references to the original relation's output are not broken
+val outputIdMap = l.output.map(att => (att.name, 
att.exprId)).toMap
+val prunedRelationOutput =
+  prunedParquetRelation
+.schema
+.toAttributes
+.map {
+  case att if outputIdMap.contains(att.name) =>
+att.withExprId(outputIdMap(att.name))
+  case att => att
+}
+val prunedRelation =
+  l.copy(relation = prunedParquetRelation, output = 
prunedRelationOutput)
+
+  

[GitHub] spark pull request #21421: Myfeature

2018-05-24 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21421


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21421: Myfeature

2018-05-24 Thread gentlewangyu
GitHub user gentlewangyu opened a pull request:

https://github.com/apache/spark/pull/21421

Myfeature

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gentlewangyu/spark myfeature

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21421.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21421


commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield 
Date:   2017-07-19T22:21:38Z

[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of 
Dataset#joinWith

## What changes were proposed in this pull request?

Two invalid join types were mistakenly listed in the javadoc for joinWith, 
in the Dataset class. I presume these were copied from the javadoc of join, but 
since joinWith returns a Dataset\, left_semi and left_anti are 
invalid, as they only return values from one of the datasets, instead of from 
both

## How was this patch tested?

I ran the following code :
```
public static void main(String[] args) {
SparkSession spark = new SparkSession(new SparkContext("local[*]", 
"Test"));
Dataset one = spark.createDataFrame(Arrays.asList(new Bean(1), new 
Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
Dataset two = spark.createDataFrame(Arrays.asList(new Bean(4), new 
Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);

try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"inner").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"cross").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"full_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"right_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_semi").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")), 
"left_anti").show();} catch (Exception e) {e.printStackTrace();}
}
```
which tests all the different join types, and the last two (left_semi and 
left_anti) threw exceptions. The same code using join instead of joinWith did 
fine. The Bean class was just a java bean with a single int field, x.

Author: Corey Woodfield 

Closes #18462 from coreywoodfield/master.

(cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
Signed-off-by: gatorsmile 

commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar 
Date:   2017-07-21T19:03:46Z

[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch

For configurations with external shuffle enabled, we have observed that if 
a very large no. of blocks are being fetched from a remote host, it puts the NM 
under extra pressure and can crash it. This change introduces a configuration 
`spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs 
being fetched from a given remote address. The changes applied here are 
applicable for both the scenarios - when external shuffle is enabled as well as 
disabled.

Ran the job with the default configuration which does not change the 
existing behavior and ran it with few configurations of lower values 
-10,20,50,100. The job ran fine and there is no change in the output. (I will 
update the metrics related to NM in some time.)

Author: Dhruve Ashar 

Closes #18487 from dhruve/impr/SPARK-21243.

Author: Dhruve Ashar 

Closes #18691 from dhruve

[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21391
  
**[Test build #91093 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91093/testReport)**
 for PR 21391 at commit 
[`d9a440a`](https://github.com/apache/spark/commit/d9a440a9814913827fcfcff644c741a43332b02d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread yucai
Github user yucai commented on the issue:

https://github.com/apache/spark/pull/21391
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190491050
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -879,6 +879,15 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("select function over nested data") {
--- End diff --

I did some forensics to understand the origin and reason for this test. It 
was part of the first commit of my original PR almost two years ago. Even then, 
this test passes without the rest of the commit. So I can't say why I added 
this test except perhaps that I felt it was useful code coverage.

In any case, if you don't think it's a valuable contribution in this PR I'd 
rather just remove it entirely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21416
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3542/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21416
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19560: [SPARK-22334][SQL] Check table size from filesystem in c...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19560
  
**[Test build #91092 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91092/testReport)**
 for PR 19560 at commit 
[`78b34bd`](https://github.com/apache/spark/commit/78b34bd7b79550b23730e1c9cdf06620e52b66f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21419: Branch 2.2

2018-05-24 Thread gentlewangyu
Github user gentlewangyu closed the pull request at:

https://github.com/apache/spark/pull/21419


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21419: Branch 2.2

2018-05-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21419
  
@gentlewangyu, please close this and read 
https://spark.apache.org/contributing.html. Questions should go to mailing list 
and issues should be filed in JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21420
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3541/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21420
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21416
  
**[Test build #91091 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91091/testReport)**
 for PR 21416 at commit 
[`de3296c`](https://github.com/apache/spark/commit/de3296c116b07d6560b1f89ca97c2014d36d10a8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190486220
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1256,8 +1256,18 @@ object SQLConf {
 "issues. Turn on this config to insert a local sort before 
actually doing repartition " +
 "to generate consistent repartition results. The performance of 
repartition() may go " +
 "down since we insert extra local sort before it.")
+.booleanConf
+.createWithDefault(true)
+
+  val NESTED_SCHEMA_PRUNING_ENABLED =
+buildConf("spark.sql.nestedSchemaPruning.enabled")
+  .internal()
+  .doc("Prune nested fields from a logical relation's output which are 
unnecessary in " +
+"satisfying a query. This optimization allows columnar file format 
readers to avoid " +
+"reading unnecessary nested column data. Currently Parquet is the 
only data source that " +
--- End diff --

ORC should be able to support this capability as well, but this PR does not 
address that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/21416
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21069: [SPARK-23920][SQL]add array_remove to remove all ...

2018-05-24 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21069#discussion_r190485891
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -1882,3 +1882,123 @@ case class ArrayRepeat(left: Expression, right: 
Expression)
   }
 
 }
+
+/**
+ * Remove all elements that equal to element from the given array
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(array, element) - Remove all elements that equal to 
element from array.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3, null, 3), 3);
+   [1,2,null]
+  """, since = "2.4.0")
+case class ArrayRemove(left: Expression, right: Expression)
+  extends BinaryExpression with ImplicitCastInputTypes {
+
+  override def dataType: DataType = left.dataType
+
+  override def inputTypes: Seq[AbstractDataType] =
+Seq(ArrayType, left.dataType.asInstanceOf[ArrayType].elementType)
+
+  lazy val elementType: DataType = 
left.dataType.asInstanceOf[ArrayType].elementType
+
+  @transient private lazy val ordering: Ordering[Any] =
+TypeUtils.getInterpretedOrdering(right.dataType)
+
+  override def checkInputDataTypes(): TypeCheckResult = {
+if (!left.dataType.isInstanceOf[ArrayType]
+  || left.dataType.asInstanceOf[ArrayType].elementType != 
right.dataType) {
+  TypeCheckResult.TypeCheckFailure(
+"Arguments must be an array followed by a value of same type as 
the array members")
+} else {
+  TypeUtils.checkForOrderingExpr(right.dataType, s"function 
$prettyName")
+}
+  }
+
+  override def nullSafeEval(arr: Any, value: Any): Any = {
+val newArray = new 
Array[Any](arr.asInstanceOf[ArrayData].numElements())
+var pos = 0
+arr.asInstanceOf[ArrayData].foreach(right.dataType, (i, v) =>
+  if (v == null) {
+if (value != null) {
--- End diff --

nit: Do we need this check since we are in `nullSafeEval`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190485768
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -286,7 +286,19 @@ case class FileSourceScanExec(
   } getOrElse {
 metadata
   }
-withOptPartitionCount
+val withOptColumnCount = relation.fileFormat match {
+  case columnar: ColumnarFileFormat =>
+SparkSession
+  .getActiveSession
+  .map { sparkSession =>
+val columnCount = columnar.columnCountForSchema(sparkSession, 
requiredSchema)
+withOptPartitionCount + ("ColumnCount" -> columnCount.toString)
--- End diff --

Replied above.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isinSet in DataFrame API for S...

2018-05-24 Thread dbtsai
Github user dbtsai commented on the issue:

https://github.com/apache/spark/pull/21416
  
test this again


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190485713
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ColumnarFileFormat.scala
 ---
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.types.StructType
+
+/**
+ * An optional mix-in for columnar [[FileFormat]]s. This trait provides 
some helpful metadata when
+ * debugging a physical query plan.
+ */
+private[sql] trait ColumnarFileFormat {
--- End diff --

I'm okay with adding this as a separate PR as requested, but I wouldn't 
want to ship this PR in a release without being able to see the physical column 
count associated with a query in a query plan. That is invaluable functionality 
in validating that physical column pruning is occurring as expected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21370#discussion_r190485347
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3040,6 +3040,50 @@ def test_csv_sampling_ratio(self):
 .csv(rdd, samplingRatio=0.5).schema
 self.assertEquals(schema, StructType([StructField("_c0", 
IntegerType(), True)]))
 
+def _get_content(self, content):
+"""
+Strips leading spaces from content up to the first '|' in each 
line.
+"""
+import re
+pattern = re.compile(r'^ *\|', re.MULTILINE)
--- End diff --

We don't have to compile the pattern each time here since it's not going to 
be reused. You could just put this into re.sub I believe.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21420
  
**[Test build #91090 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91090/testReport)**
 for PR 21420 at commit 
[`a41c99b`](https://github.com/apache/spark/commit/a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-05-24 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r190484386
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala
 ---
@@ -99,27 +100,28 @@ trait ConstraintHelper {
   }
 
   /**
-   * Infer the Attribute-specific IsNotNull constraints from the null 
intolerant child expressions
-   * of constraints.
+   * Infer the Attribute and ExtractValue-specific IsNotNull constraints 
from the null intolerant
+   * child expressions of constraints.
*/
   private def inferIsNotNullConstraints(constraint: Expression): 
Seq[Expression] =
 constraint match {
   // When the root is IsNotNull, we can push IsNotNull through the 
child null intolerant
   // expressions
-  case IsNotNull(expr) => 
scanNullIntolerantAttribute(expr).map(IsNotNull(_))
+  case IsNotNull(expr) => 
scanNullIntolerantField(expr).map(IsNotNull(_))
   // Constraints always return true for all the inputs. That means, 
null will never be returned.
   // Thus, we can infer `IsNotNull(constraint)`, and also push 
IsNotNull through the child
   // null intolerant expressions.
-  case _ => scanNullIntolerantAttribute(constraint).map(IsNotNull(_))
+  case _ => scanNullIntolerantField(constraint).map(IsNotNull(_))
 }
 
   /**
-   * Recursively explores the expressions which are null intolerant and 
returns all attributes
-   * in these expressions.
+   * Recursively explores the expressions which are null intolerant and 
returns all attributes and
+   * complex type extractors in these expressions.
*/
-  private def scanNullIntolerantAttribute(expr: Expression): 
Seq[Attribute] = expr match {
+  private def scanNullIntolerantField(expr: Expression): Seq[Expression] = 
expr match {
+case ev: ExtractValue => Seq(ev)
--- End diff --

I agree adding more direct and independent test coverage for this change is 
a good idea. However, omitting this change will weaken the capabilities of this 
PR. It would also imply the removal of the failing test case in 
`ParquetSchemaPruningSuite`, which would imply two follow on PRs. The first 
would be to add this specific change plus the right test coverage. The next 
would be to restore the test case removed from 'ParquetSchemaPruningSuite'.

Let me suggest an alternative. As this change is a valuable enhancement for 
this PR, let me try adding an appropriate test case in 
`InferFiltersFromConstraintsSuite` as part of this PR. That will eliminate the 
requirement for two more follow-on PRs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...

2018-05-24 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21420
  
CC @HyukjinKwon @vanzin please help to review, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...

2018-05-24 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/21420

[SPARK-24377][Spark Submit] make --py-files work in non pyspark application

## What changes were proposed in this pull request?

For some Spark applications, though they're a java program, they require 
not only jar dependencies, but also python dependencies. One example is Livy 
remote SparkContext application, this application is actually an embedded REPL 
for Scala/Python/R, it will not only load in jar dependencies, but also python 
and R deps, so we should specify not only "--jars", but also "--py-files".

Currently for a Spark application, --py-files can only be worked for a 
pyspark application, so it will not be worked in the above case. So here 
propose to remove such restriction.

Also we tested that "spark.submit.pyFiles" only supports quite limited 
scenario (client mode with local deps), so here also expand the usage of 
"spark.submit.pyFiles" to be alternative of --py-files.

## How was this patch tested?

UT added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-24377

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21420


commit a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4
Author: jerryshao 
Date:   2018-05-24T06:53:23Z

make --py-files work in non pyspark application




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21295
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21295
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91086/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21069
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21069: [SPARK-23920][SQL]add array_remove to remove all element...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21069
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91081/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21391
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21391: [SPARK-24343][SQL] Avoid shuffle for the bucketed table ...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21391
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91088/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21399: [SPARK-22269][BUILD] Run Java linter via SBT for Jenkins

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21399
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21390: [SPARK-24340][Core] Clean up non-shuffle disk block mana...

2018-05-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21390
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91083/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >