[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-20 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r211485248
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -626,6 +626,7 @@ object DataSource extends Logging {
   
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList 
match {
 // the provider format did not match any given registered aliases
 case Nil =>
+  val latestDocsURL = "https://spark.apache.org/docs/latest;
--- End diff --

The doc will be like 

https://github.com/apache/spark/pull/22121/files#diff-acdddc6cbd45ccd226bf151564b9cc40R11

It is about loading the module with `--package`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22158: [SPARK-25161][Core] Fix several bugs in failure handling...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22158
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2357/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22158: [SPARK-25161][Core] Fix several bugs in failure handling...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22158
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22148: [SPARK-25132][SQL] Case-insensitive field resolution whe...

2018-08-20 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22148
  
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22158: [SPARK-25161][Core] Fix several bugs in failure handling...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22158
  
**[Test build #94998 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94998/testReport)**
 for PR 22158 at commit 
[`32ea946`](https://github.com/apache/spark/commit/32ea946c68c5f3108fb18f7e936ba440f7537144).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2356/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22158: [SPARK-25161][Core] Fix several bugs in failure handling...

2018-08-20 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/22158
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22153
  
**[Test build #94997 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94997/testReport)**
 for PR 22153 at commit 
[`e237e39`](https://github.com/apache/spark/commit/e237e3944fa5839e5fa17b07af7901ac56655a4b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
> always return the same result with same order when rerun..

maybe the word "idempotent" is not that accurate. Spark doesn't really care 
about the order, so the requirement is, for the same input data set, it should 
return the same output set.

As an example, `iter1.zip(iter2)` will be treated as invalid, unless we 
sort before zip. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-20 Thread cclauss
Github user cclauss commented on the issue:

https://github.com/apache/spark/pull/20838
  
Thanks massively for this.  I doubt that I _ever_ would have gotten to that 
on my own.  This is a test so my proposal would be that _you create a separate 
PR_ so that we are all assured that it passes in the current codebase.  Once 
that PR has been merged, I can come back and finish this PR.  Thanks again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r211482746
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -626,6 +626,7 @@ object DataSource extends Logging {
   
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList 
match {
 // the provider format did not match any given registered aliases
 case Nil =>
+  val latestDocsURL = "https://spark.apache.org/docs/latest;
--- End diff --

I mean, if we happen to have Spark 3.0.0 then this link will be stale in 
2.4.0.. no?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22133: [SPARK-25129][SQL]Make the mapping of com.databricks.spa...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22133
  
**[Test build #94996 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94996/testReport)**
 for PR 22133 at commit 
[`e57b232`](https://github.com/apache/spark/commit/e57b232ec8e36ea107ce103e5cdb6efaa0756c40).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22133: [SPARK-25129][SQL]Make the mapping of com.databricks.spa...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22133
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22133: [SPARK-25129][SQL]Make the mapping of com.databricks.spa...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22133
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2355/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22165
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2354/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r211482547
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -593,7 +592,6 @@ object DataSource extends Logging {
   "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
   "org.apache.spark.ml.source.libsvm" -> libsvm,
   "com.databricks.spark.csv" -> csv,
-  "com.databricks.spark.avro" -> avro,
--- End diff --

Ah okie makes sense if there's a reason.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22165
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-20 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r211482461
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -626,6 +626,7 @@ object DataSource extends Logging {
   
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList 
match {
 // the provider format did not match any given registered aliases
 case Nil =>
+  val latestDocsURL = "https://spark.apache.org/docs/latest;
--- End diff --

This is the link for the latest doc. I think it should be ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94995/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22153
  
**[Test build #94995 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94995/testReport)**
 for PR 22153 at commit 
[`732bc5f`](https://github.com/apache/spark/commit/732bc5f5d049c93f40d01926ac1efe8495e27b58).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-20 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r211482149
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -593,7 +592,6 @@ object DataSource extends Logging {
   "org.apache.spark.ml.source.libsvm.DefaultSource" -> libsvm,
   "org.apache.spark.ml.source.libsvm" -> libsvm,
   "com.databricks.spark.csv" -> csv,
-  "com.databricks.spark.avro" -> avro,
--- End diff --

@HyukjinKwon I did add it in the `backwardCompatibilityMap` at first. But 
later on I find that the configuration won't be effective in run time, since 
the `backwardCompatibilityMap` is a `val`.  (We can change 
`backwardCompatibilityMap` to method to resolve that.) Also the code looks ugly.
```
val ret = Map(...)
if(...) {
 ret + k -> v
} else {
 ret
}
// it would be worse if we have more configurations.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22165
  
**[Test build #94994 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94994/testReport)**
 for PR 22165 at commit 
[`21bd1c3`](https://github.com/apache/spark/commit/21bd1c37f4af6480adfc07130a15f70acdeda378).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2353/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22153
  
**[Test build #94995 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94995/testReport)**
 for PR 22153 at commit 
[`732bc5f`](https://github.com/apache/spark/commit/732bc5f5d049c93f40d01926ac1efe8495e27b58).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22165: [SPARK-25017][Core] Add test suite for BarrierCoo...

2018-08-20 Thread xuanyuanking
GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/22165

[SPARK-25017][Core] Add test suite for BarrierCoordinator and 
ContextBarrierState

## What changes were proposed in this pull request?

Currently `ContextBarrierState` and `BarrierCoordinator` are only covered 
by end-to-end test in `BarrierTaskContextSuite`, add BarrierCoordinatorSuite to 
test both classes.

## How was this patch tested?

UT in BarrierCoordinatorSuite.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-25017

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22165.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22165


commit 21bd1c37f4af6480adfc07130a15f70acdeda378
Author: liyuanjian 
Date:   2018-08-21T05:24:07Z

[SPARK-25017][Core] Add test suite for BarrierCoordinator and 
ContextBarrierState




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-20 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20838
  
Hi @cclauss , sorry for the frustration.  I looked into the test, and it 
was kind of a pain to get it working right - which is probably why it wasn't 
done in the first place ;)

Here are my modifications for `test_slice` and it seems to pass py3 fine

```python
def test_slice(self):
"""Basic operation test for DStream.slice."""
import datetime as dt
self.ssc = StreamingContext(self.sc, 1.0)
self.ssc.remember(4.0)
input = [[1], [2], [3], [4]]
stream = self.ssc.queueStream([self.sc.parallelize(d, 1) for d in 
input])

time_vals = []

def get_times(t, rdd):
if rdd and len(time_vals) < len(input):
time_vals.append(t)

stream.foreachRDD(get_times)

self.ssc.start()
self.wait_for(time_vals, 4)
begin_time = time_vals[0]

def get_sliced(begin_delta, end_delta):
begin = begin_time + dt.timedelta(seconds=begin_delta)
end = begin_time + dt.timedelta(seconds=end_delta)
rdds = stream.slice(begin, end)
result_list = [rdd.collect() for rdd in rdds]
return [r for result in result_list for r in result]

self.assertEqual(set([1]), set(get_sliced(0, 0)))
self.assertEqual(set([2, 3]), set(get_sliced(1, 2)))
self.assertEqual(set([2, 3, 4]), set(get_sliced(1, 4)))
self.assertEqual(set([1, 2, 3, 4]), set(get_sliced(0, 4)))
```

If you want to put that in, I have some time now and can help you get this 
merged or if you prefer I can finish it up and still assign to you.

```p


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94986/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22153
  
**[Test build #94986 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94986/testReport)**
 for PR 22153 at commit 
[`e0c048e`](https://github.com/apache/spark/commit/e0c048e34635d60e5d7eeb391ea2046727e2fd35).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22161: [SPARK-25167][SPARKR][TEST][MINOR] Minor fixes for R sql...

2018-08-20 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22161
  
@HyukjinKwon Done. 
[SPARK-25167](https://issues.apache.org/jira/browse/SPARK-25167)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94991/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22164
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22164
  
**[Test build #94991 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94991/testReport)**
 for PR 22164 at commit 
[`da33554`](https://github.com/apache/spark/commit/da33554cc38d4b41e86dcb6e2c833f5b29c35ad8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22163
  
**[Test build #94993 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94993/testReport)**
 for PR 22163 at commit 
[`bcef61e`](https://github.com/apache/spark/commit/bcef61e3c1e65e797c8044b674c5ae99c89ce222).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2352/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22154
  
**[Test build #94992 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94992/testReport)**
 for PR 22154 at commit 
[`5f0ff13`](https://github.com/apache/spark/commit/5f0ff13cc5c30d99fa77551fd617783c29e4864b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22154
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22112
  
If "always return the same result with same order when rerun." is the 
definition of "idempotent", then yes, MLlib RDD closures always returns the 
same result if the input doesn't change. We use pseudo-randomness to achieve 
deterministic behavior.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22154
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2351/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22154
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22154
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22154
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94985/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22154
  
**[Test build #94985 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94985/testReport)**
 for PR 22154 at commit 
[`5f0ff13`](https://github.com/apache/spark/commit/5f0ff13cc5c30d99fa77551fd617783c29e4864b).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2350/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22154
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22154
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94983/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22164
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22154: [SPARK-23711][SPARK-25140][SQL] Catch correct exceptions...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22154
  
**[Test build #94983 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94983/testReport)**
 for PR 22154 at commit 
[`e3e86c6`](https://github.com/apache/spark/commit/e3e86c645d5c75c1c490881564ec7ea4f909d2ee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA ...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22164
  
**[Test build #94991 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94991/testReport)**
 for PR 22164 at commit 
[`da33554`](https://github.com/apache/spark/commit/da33554cc38d4b41e86dcb6e2c833f5b29c35ad8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22164: [SPARK-23679][YARN] Fix AmIpFilter cannot work in...

2018-08-20 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/22164

[SPARK-23679][YARN] Fix AmIpFilter cannot work in RM HA scenario

## What changes were proposed in this pull request?

YARN `AmIpFilter` adds a new parameter "RM_HA_URLS" to support RM HA, but 
Spark on YARN doesn't provide a such parameter, so it will be failed to 
redirect when running on RM HA. The detailed exception can be checked from 
JIRA. So here fixing this issue by adding "RM_HA_URLS" parameter.

## How was this patch tested?

Local verification.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-23679

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22164


commit da33554cc38d4b41e86dcb6e2c833f5b29c35ad8
Author: jerryshao 
Date:   2018-08-20T08:28:13Z

Fix AmIpFilter cannot work in RM HA scenario




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22156: [SPARK-25144][SQL][TEST][BRANCH-2.2] Free aggregate map ...

2018-08-20 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22156
  
Thank you, @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22156: [SPARK-25144][SQL][TEST][BRANCH-2.2] Free aggrega...

2018-08-20 Thread dongjoon-hyun
Github user dongjoon-hyun closed the pull request at:

https://github.com/apache/spark/pull/22156


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22155: [SPARK-25144][SQL][TEST] Free aggregate map when task en...

2018-08-20 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22155
  
Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22163
  
The current buffer is `writeBuffer`, I mean copying `writeBuffer` to 
'diskWriteBuffer' or other buffer


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22163
  
The current buffer is `writeBuffer`.  I mean copying `writeBuffer` to 
`diskWriteBuffer` or other buffer


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21859
  
If this optimization is done more generally, will the implicitly cached 
data cause memory pressure on driver, as seems we don't have way to release 
them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20637: [SPARK-23466][SQL] Remove redundant null checks i...

2018-08-20 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20637#discussion_r211468226
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 ---
@@ -43,25 +45,30 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 case _ => false
   }
 
-  // TODO: if the nullability of field is correct, we can use it to save 
null check.
   private def writeStructToBuffer(
   ctx: CodegenContext,
   input: String,
   index: String,
-  fieldTypes: Seq[DataType],
+  fieldTypeAndNullables: Seq[Schema],
   rowWriter: String): String = {
 // Puts `input` in a local variable to avoid to re-evaluate it if it's 
a statement.
 val tmpInput = ctx.freshName("tmpInput")
-val fieldEvals = fieldTypes.zipWithIndex.map { case (dt, i) =>
-  ExprCode(
-JavaCode.isNullExpression(s"$tmpInput.isNullAt($i)"),
-JavaCode.expression(CodeGenerator.getValue(tmpInput, dt, 
i.toString), dt))
+val fieldEvals = fieldTypeAndNullables.zipWithIndex.map { case 
(dtNullable, i) =>
+  val isNull = if (dtNullable.nullable) {
+JavaCode.isNullExpression(s"$tmpInput.isNullAt($i)")
+  } else {
+FalseLiteral
+  }
+  ExprCode(isNull, JavaCode.expression(
+CodeGenerator.getValue(tmpInput, dtNullable.dataType, i.toString), 
dtNullable.dataType))
 }
 
 val rowWriterClass = classOf[UnsafeRowWriter].getName
 val structRowWriter = ctx.addMutableState(rowWriterClass, "rowWriter",
   v => s"$v = new $rowWriterClass($rowWriter, ${fieldEvals.length});")
 val previousCursor = ctx.freshName("previousCursor")
+val structExpressions = writeExpressionsToBuffer(
+  ctx, tmpInput, fieldEvals, fieldTypeAndNullables.map(_.dataType), 
structRowWriter)
--- End diff --

I see here, but another call of `writeExpressionsToBuffer` from 
`createCode` should pass nullable to `writeExpressionsToBuffer` because 
`exprEvals.isNull` there is not always `FalseLiteral` even if an expression is 
non-nullable?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22163
  
What you mean `only one record is written to a buffer each time`? Isn't it 
controlled by `diskWriteBufferSize` to write such size of data each time?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94984/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20345
  
**[Test build #94984 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94984/testReport)**
 for PR 20345 at commit 
[`39462fb`](https://github.com/apache/spark/commit/39462fbee952ec574b4c04d7718fd73bb5f56d9d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22140
  
cc @BryanCutler as well since we discussed an issue about this code path 
before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-20 Thread sddyljsx
Github user sddyljsx commented on the issue:

https://github.com/apache/spark/pull/21859
  
'The ShuffleWriter should treat RangePartitioner specially and consume the 
sampled data in RangePartitioner instead of the input iterator.' This idea is 
good, maybe we can cache both the K and V when doing sample.
I will have a try on this idea.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22163
  
**[Test build #94989 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94989/testReport)**
 for PR 22163 at commit 
[`671268b`](https://github.com/apache/spark/commit/671268b679f9221fd96e9ab2ea929df4a9908de8).
 * This patch **fails Java style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94989/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21859
  
**[Test build #94990 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94990/testReport)**
 for PR 21859 at commit 
[`6f52f1f`](https://github.com/apache/spark/commit/6f52f1fde3d4df9384e1c99d08b930953843bcde).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-20 Thread sddyljsx
Github user sddyljsx commented on the issue:

https://github.com/apache/spark/pull/21859
  
I read the source code again. 
The RangePartitioner[K, V] in ShuffleExchangeExec is an instance of 
RangePartitioner[InternalRow, Null]. RangePartitioner only sample K for getting 
the rangeBounds. So We can get the InternalRow when doing sample.
After getting the RangePartitioner, the ShuffleExchangeExec will map the 
InternalRow to [partitionId, InternalRow] for shuffle (the RangePartitioner 
generates the partitionId).
The shuffle won't use the RangePartitioner, it will use 
PartitionIdPassthrough instead.
In other words, the ShuffleWriter won't know the RangePartitioner's 
existence.

```
val rddWithPartitionIds: RDD[Product2[Int, InternalRow]] = 
newRdd.mapPartitionsInternal { iter =>
  val getPartitionKey = getPartitionKeyExtractor()
  val mutablePair = new MutablePair[Int, InternalRow]()
  iter.map { row => 
mutablePair.update(part.getPartition(getPartitionKey(row)), row) }
}

 val dependency =
  new ShuffleDependency[Int, InternalRow, InternalRow](
rddWithPartitionIds,
new PartitionIdPassthrough(part.numPartitions),
serializer)

private class PartitionIdPassthrough(override val numPartitions: Int) 
extends Partitioner {
  override def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
```

The optimization will parallelize the cached InternalRow to the newRdd 
instead of getting it again.

But in other places, like rdd's sortByKey

```
def sortByKey(ascending: Boolean = true, numPartitions: Int = 
self.partitions.length)
  : RDD[(K, V)] = self.withScope
  {
val part = new RangePartitioner(numPartitions, self, ascending)
new ShuffledRDD[K, V, V](self, part)
  .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }
// getDependencies function in ShuffledRDD
override def getDependencies: Seq[Dependency[_]] = {
val serializer = userSpecifiedSerializer.getOrElse {
  val serializerManager = SparkEnv.get.serializerManager
  if (mapSideCombine) {
serializerManager.getSerializer(implicitly[ClassTag[K]], 
implicitly[ClassTag[C]])
  } else {
serializerManager.getSerializer(implicitly[ClassTag[K]], 
implicitly[ClassTag[V]])
  }
}
List(new ShuffleDependency(prev, part, serializer, keyOrdering, 
aggregator, mapSideCombine))
  }

```
The rdd is [K, V], and the shuffle uses RangePartitioner directly.  But we 
can only get K when doing sample. so we can't restore the rdd using the cache.

They work in two different ways.

So the optimization only works in Spark Sql's ShuffleExchangeExec by now.

'The ShuffleWriter should treat RangePartitioner specially and consume the 
sampled data in RangePartitioner instead of the input iterator.' This idea is 
good, maybe we can cache both the K and V when doing sample. I will have a try.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22065: [SPARK-23992][CORE] ShuffleDependency does not need to b...

2018-08-20 Thread 10110346
Github user 10110346 commented on the issue:

https://github.com/apache/spark/pull/22065
  
This is end-to-end performance improvement, although our data is very small.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22157
  
> Do we have a similar issue for Parquet?

Looks not since we explicitly pick up one file before reading in schema 
inference: 


https://github.com/apache/spark/blob/f984ec75ed6162ee6f5881716a8311c883aca22a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L229-L239


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22163
  
**[Test build #94989 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94989/testReport)**
 for PR 22163 at commit 
[`671268b`](https://github.com/apache/spark/commit/671268b679f9221fd96e9ab2ea929df4a9908de8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2349/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
So there are 2 options:

1. ask the RDD closure to be idempotent. I'm not sure if it's OK for MLlib, 
cc @mengxr @WeichenXu123 @yanboliang 

2. ask the output committer to be able to overwrite a committed task. Note 
that, the output committer here is the `FileCommitProtocol` interface in Spark, 
not the hadoop output committer. We don't have to make it all the hadoop output 
committers work.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-20 Thread 10110346
GitHub user 10110346 opened a pull request:

https://github.com/apache/spark/pull/22163

[SPARK-25166][CORE]Reduce the number of write operations for shuffle write.

## What changes were proposed in this pull request?

Currently, only one record is written to a buffer each time, which 
increases the number of copies.
I think we should write as many records as possible each time.

## How was this patch tested?
Existed  unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/10110346/spark reducewrite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22163


commit 671268b679f9221fd96e9ab2ea929df4a9908de8
Author: liuxian 
Date:   2018-08-21T02:42:30Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94988 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94988/testReport)**
 for PR 22112 at commit 
[`4f8e24d`](https://github.com/apache/spark/commit/4f8e24d33e6df2c60740a6c4d0ebec4db4123f5b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2348/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...

2018-08-20 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/22138
  
@koeninger 
Yeah I see what you're saying, then IMHO isolating consumers with query 
sounds better than others. Adding next offset to the cache key would make 
consumer moving bucket in cache every time it is processed, which is not 
expected behavior for general pool solution and we have to reinvent the wheel 
(and it is not ideal situation for caching, too).

There's an evict thread in Apache Commons Pool running on background, and 
we could close consumers being idle for a long time (say 5 mins or higher). 
That's another benefit of adopting Apache Commons Pool (maybe available for 
most of general pool solutions): we could also evict cached consumers 
eventually which topic or partition is removed while query is running. It is 
not only evicted because of exceeding cache, but also time of inactivity.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22161: [SPARKR][TEST][MINOR] Minor fixes for R sql tests

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22161
  
Eh, @dilipbiswal, actually can we file a JIRA?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21306
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94982/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22148: [SPARK-25132][SQL] Case-insensitive field resolut...

2018-08-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22148


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21306
  
**[Test build #94982 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94982/testReport)**
 for PR 21306 at commit 
[`6b45a11`](https://github.com/apache/spark/commit/6b45a119df8e6382fa2503f854b4a85aed3e3785).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  abstract static class SingleColumnTransform implements 
PartitionTransform `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22148: [SPARK-25132][SQL] Case-insensitive field resolution whe...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22148
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21320
  
**[Test build #94987 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94987/testReport)**
 for PR 21320 at commit 
[`97b3a51`](https://github.com/apache/spark/commit/97b3a51d478f19890ded73aa78d94c055a9f144c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21320
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2347/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21320
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21320
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22123


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22123
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-08-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21669
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94979/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22133: [SPARK-25129][SQL]Make the mapping of com.databricks.spa...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22133
  
Seems fine otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-08-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21669
  
**[Test build #94979 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94979/testReport)**
 for PR 21669 at commit 
[`4a000d2`](https://github.com/apache/spark/commit/4a000d2abda968a28f419d21418f61e2f53355fc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22133: [SPARK-25129][SQL]Make the mapping of com.databri...

2018-08-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22133#discussion_r211460921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -626,6 +626,7 @@ object DataSource extends Logging {
   
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList 
match {
 // the provider format did not match any given registered aliases
 case Nil =>
+  val latestDocsURL = "https://spark.apache.org/docs/latest;
--- End diff --

I would actually avoid to leave the explicit doc link because we will have 
to fix it for every release. Just prose should be good enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >