[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22530
  
**[Test build #98183 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98183/testReport)**
 for PR 22530 at commit 
[`9b1cc1d`](https://github.com/apache/spark/commit/9b1cc1d826cb89f0ed6021ae6c8cddc978c0173e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...

2018-10-28 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22861
  
Personally I am against accessing the main args in such way. It looks a bit 
ugly. 
But if we have to move everything to `BenchmarkBase`, then maybe this is 
the way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...

2018-10-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22865


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22530
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22530
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4580/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22865
  
thanks, merging to master/2.4/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22530: [SPARK-24869][SQL] Fix SaveIntoDataSourceCommand's input...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22530
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22865
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22865
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98176/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22865
  
**[Test build #98176 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98176/testReport)**
 for PR 22865 at commit 
[`de22015`](https://github.com/apache/spark/commit/de22015a9b610011ff5616d398b999d39d21eeba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22326: [SPARK-25314][SQL] Fix Python UDF accessing attributes f...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22326
  
late LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22847: [SPARK-25850][SQL] Make the split threshold for the code...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22847
  
**[Test build #98182 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98182/testReport)**
 for PR 22847 at commit 
[`b0ce2ca`](https://github.com/apache/spark/commit/b0ce2cae731620a0ce7417b2b46c24cffb023059).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22721
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4579/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22721
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22870
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22721
  
**[Test build #98181 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98181/testReport)**
 for PR 22721 at commit 
[`6c8a73f`](https://github.com/apache/spark/commit/6c8a73f0fe74f618b429dee23869a00e706b125d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22870
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4578/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22870
  
**[Test build #98180 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98180/testReport)**
 for PR 22870 at commit 
[`23d31bb`](https://github.com/apache/spark/commit/23d31bb7c55b228d748c8911e2c730db9b59552e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22721: [SPARK-25403][SQL] Refreshes the table after inserting t...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22721
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22830
  
Who introduced this? We should ask the person that introduced it whether it 
can be removed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22813: [SPARK-25818][CORE] WorkDirCleanup should only remove th...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22813
  
@ouyangxiaochen . Sorry, but the use case sounds like a misconfiguration.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22830: [SPARK-25838][ML] Remove formatVersion from Savea...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22830#discussion_r228800501
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -163,8 +163,6 @@ class LogisticRegressionModel @Since("1.3.0") (
   numFeatures, numClasses, weights, intercept, threshold)
   }
 
-  override protected def formatVersion: String = "1.0"
-
--- End diff --

This seems visible to the users. Although this occurs in Spark 3.0.0, I 
guess we had better do deprecation processes first for this kind of removals.

cc @srowen , @mengxr , @rxin .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22830
  
**[Test build #98179 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98179/testReport)**
 for PR 22830 at commit 
[`e9d9e0e`](https://github.com/apache/spark/commit/e9d9e0eb6a7caf0b3fda27bfb3d03fa419e4487e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22830
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4577/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22830
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22830: [SPARK-25838][ML] Remove formatVersion from Saveable

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22830
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22784
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98178/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22784
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22784
  
**[Test build #98178 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98178/testReport)**
 for PR 22784 at commit 
[`0effc85`](https://github.com/apache/spark/commit/0effc85ccfc831bcc4c469b4a4c1d8db26fab72e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22851
  
Thank you, @seancxmao and @felixcheung .

@seancxmao . Please close this PR since it's merged now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22851
  
Merged to `branch-2.3`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22861
  
At least, the following is worth for a separate PR because it's orthogonal 
`Refactor ... to use main method`.
```
1. Make mainArgs correctly set in BenchmarkBase.
```

One PR had better have one theme. Putting different themes into one PR 
together is not a good practice.

Please start with the minimal one. If the committer asks some example, then 
add that later. That's better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22514
  
I see. Thank you for confirmation, @gatorsmile and @cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r228789484
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
--- End diff --

Not a big deal but I would avoid abbreviation in documentation. `can't` -> 
`cannot`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535 column...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22784
  
**[Test build #98178 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98178/testReport)**
 for PR 22784 at commit 
[`0effc85`](https://github.com/apache/spark/commit/0effc85ccfc831bcc4c469b4a4c1d8db26fab72e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22288: [SPARK-22148][SPARK-15815][Scheduler] Acquire new...

2018-10-28 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22288#discussion_r228788350
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -415,9 +420,54 @@ private[spark] class TaskSchedulerImpl(
 launchedAnyTask |= launchedTaskAtCurrentMaxLocality
   } while (launchedTaskAtCurrentMaxLocality)
 }
+
 if (!launchedAnyTask) {
-  taskSet.abortIfCompletelyBlacklisted(hostToExecutors)
+  taskSet.getCompletelyBlacklistedTaskIfAny(hostToExecutors) match 
{
+case Some(taskIndex) => // Returns the taskIndex which was 
unschedulable
+
+  // If the taskSet is unschedulable we try to find an 
existing idle blacklisted
+  // executor. If we cannot find one, we abort immediately. 
Else we kill the idle
+  // executor and kick off an abortTimer which if it doesn't 
schedule a task within the
+  // the timeout will abort the taskSet if we were unable to 
schedule any task from the
+  // taskSet.
+  // Note 1: We keep track of schedulability on a per taskSet 
basis rather than on a per
+  // task basis.
+  // Note 2: The taskSet can still be aborted when there are 
more than one idle
+  // blacklisted executors and dynamic allocation is on. This 
can happen when a killed
+  // idle executor isn't replaced in time by 
ExecutorAllocationManager as it relies on
+  // pending tasks and doesn't kill executors on idle 
timeouts, resulting in the abort
+  // timer to expire and abort the taskSet.
+  executorIdToRunningTaskIds.find(x => !isExecutorBusy(x._1)) 
match {
+case Some ((executorId, _)) =>
+  if (!unschedulableTaskSetToExpiryTime.contains(taskSet)) 
{
+blacklistTrackerOpt.foreach(blt => 
blt.killBlacklistedIdleExecutor(executorId))
+
+val timeout = 
conf.get(config.UNSCHEDULABLE_TASKSET_TIMEOUT) * 1000
+unschedulableTaskSetToExpiryTime(taskSet) = 
clock.getTimeMillis() + timeout
+logInfo(s"Waiting for $timeout ms for completely "
+  + s"blacklisted task to be schedulable again before 
aborting $taskSet.")
+abortTimer.schedule(
+  createUnschedulableTaskSetAbortTimer(taskSet, 
taskIndex), timeout)
+  }
+case _ => // Abort Immediately
+  logInfo("Cannot schedule any task because of complete 
blacklisting. No idle" +
+s" executors can be found to kill. Aborting $taskSet." 
)
+  taskSet.abortSinceCompletelyBlacklisted(taskIndex)
+  }
+case _ => // Do nothing if no tasks completely blacklisted.
+  }
+} else {
+  // We want to defer killing any taskSets as long as we have a 
non blacklisted executor
+  // which can be used to schedule a task from any active 
taskSets. This ensures that the
+  // job can make progress and if we encounter a flawed taskSet it 
will eventually either
+  // fail or abort due to being completely blacklisted.
--- End diff --

so its a bit worse than regular starvation from having competing tasksets, 
as in this case you might actually have resources available on your cluster, 
but you never ask for them, because the executor allocation manager thinks you 
have enough based on the number of pending tasks.

In any case, I agree this is a stretch, and overall its an improvement, so 
I'm OK with it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22784: [SPARK-25790][MLLIB] PCA: Support more than 65535...

2018-10-28 Thread shahidki31
Github user shahidki31 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22784#discussion_r228788331
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -49,7 +50,16 @@ class PCA @Since("1.4.0") (@Since("1.4.0") val k: Int) {
   "Try reducing the parameter k for PCA, or reduce the input feature " 
+
   "vector dimension to make this tractable.")
 
-val mat = new RowMatrix(sources)
+val mat = if (numFeatures > 65535) {
+  val meanVector = Statistics.colStats(sources).mean
--- End diff --

Thank you @srowen . I have updated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22624: [SPARK-23781][CORE] Merge token renewer functionality in...

2018-10-28 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/22624
  
one minor comment about the `start` api, but otherwise lgtm from the yarn 
side.  would need a bit more time to look at the other cluster managers if 
nobody else can vouch for those.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22847: [SPARK-25850][SQL] Make the split threshold for t...

2018-10-28 Thread yucai
Github user yucai commented on a diff in the pull request:

https://github.com/apache/spark/pull/22847#discussion_r228788123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -812,6 +812,17 @@ object SQLConf {
 .intConf
 .createWithDefault(65535)
 
+  val CODEGEN_METHOD_SPLIT_THRESHOLD = 
buildConf("spark.sql.codegen.methodSplitThreshold")
+.internal()
+.doc("The threshold of source code length without comment of a single 
Java function by " +
+  "codegen to be split. When the generated Java function source code 
exceeds this threshold" +
+  ", it will be split into multiple small functions. We can't know how 
many bytecode will " +
+  "be generated, so use the code length as metric. A function's 
bytecode should not go " +
+  "beyond 8KB, otherwise it will not be JITted; it also should not be 
too small, otherwise " +
+  "there will be many function calls.")
+.intConf
--- End diff --

To be more accurately, I think I should add `When running on HotSpot, a 
function's bytecode should not go beyond 8KB...`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22624: [SPARK-23781][CORE] Merge token renewer functiona...

2018-10-28 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/22624#discussion_r228788072
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala
 ---
@@ -17,76 +17,175 @@
 
 package org.apache.spark.deploy.security
 
+import java.io.File
+import java.security.PrivilegedExceptionAction
+import java.util.concurrent.{ScheduledExecutorService, TimeUnit}
+import java.util.concurrent.atomic.AtomicReference
+
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.FileSystem
-import org.apache.hadoop.security.Credentials
+import org.apache.hadoop.security.{Credentials, UserGroupInformation}
 
 import org.apache.spark.SparkConf
+import org.apache.spark.deploy.SparkHadoopUtil
 import org.apache.spark.internal.Logging
+import org.apache.spark.internal.config._
+import org.apache.spark.rpc.RpcEndpointRef
+import 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages.UpdateDelegationTokens
+import org.apache.spark.ui.UIUtils
+import org.apache.spark.util.ThreadUtils
 
 /**
- * Manages all the registered HadoopDelegationTokenProviders and offer 
APIs for other modules to
- * obtain delegation tokens and their renewal time. By default 
[[HadoopFSDelegationTokenProvider]],
- * [[HiveDelegationTokenProvider]] and [[HBaseDelegationTokenProvider]] 
will be loaded in if not
- * explicitly disabled.
+ * Manager for delegation tokens in a Spark application.
+ *
+ * This manager has two modes of operation:
+ *
+ * 1.  When configured with a principal and a keytab, it will make sure 
long-running apps can run
+ * without interruption while accessing secured services. It periodically 
logs in to the KDC with
+ * user-provided credentials, and contacts all the configured secure 
services to obtain delegation
+ * tokens to be distributed to the rest of the application.
+ *
+ * Because the Hadoop UGI API does not expose the TTL of the TGT, a 
configuration controls how often
+ * to check that a relogin is necessary. This is done reasonably often 
since the check is a no-op
+ * when the relogin is not yet needed. The check period can be overridden 
in the configuration.
  *
- * Also, each HadoopDelegationTokenProvider is controlled by
- * spark.security.credentials.{service}.enabled, and will not be loaded if 
this config is set to
- * false. For example, Hive's delegation token provider 
[[HiveDelegationTokenProvider]] can be
- * enabled/disabled by the configuration 
spark.security.credentials.hive.enabled.
+ * New delegation tokens are created once 75% of the renewal interval of 
the original tokens has
+ * elapsed. The new tokens are sent to the Spark driver endpoint once it's 
registered with the AM.
+ * The driver is tasked with distributing the tokens to other processes 
that might need them.
  *
- * @param sparkConf Spark configuration
- * @param hadoopConf Hadoop configuration
- * @param fileSystems Delegation tokens will be fetched for these Hadoop 
filesystems.
+ * 2. When operating without an explicit principal and keytab, token 
renewal will not be available.
+ * Starting the manager will distribute an initial set of delegation 
tokens to the provided Spark
+ * driver, but the app will not get new tokens when those expire.
+ *
+ * It can also be used just to create delegation tokens, by calling the 
`obtainDelegationTokens`
+ * method. This option does not require calling the `start` method, but 
leaves it up to the
+ * caller to distribute the tokens that were generated.
  */
 private[spark] class HadoopDelegationTokenManager(
-sparkConf: SparkConf,
-hadoopConf: Configuration,
-fileSystems: Configuration => Set[FileSystem])
-  extends Logging {
+protected val sparkConf: SparkConf,
+protected val hadoopConf: Configuration) extends Logging {
 
   private val deprecatedProviderEnabledConfigs = List(
 "spark.yarn.security.tokens.%s.enabled",
 "spark.yarn.security.credentials.%s.enabled")
   private val providerEnabledConfig = 
"spark.security.credentials.%s.enabled"
 
-  // Maintain all the registered delegation token providers
-  private val delegationTokenProviders = getDelegationTokenProviders
+  private val principal = sparkConf.get(PRINCIPAL).orNull
+  private val keytab = sparkConf.get(KEYTAB).orNull
+
+  require((principal == null) == (keytab == null),
+"Both principal and keytab must be defined, or neither.")
+  require(keytab == null || new File(keytab).isFile(), s"Cannot find 
keytab at $keytab.")
+
+  private val delegationTokenProviders = loadProviders()
   logDebug("Using the following builtin delegation token providers: " +

[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22871
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98177/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22871
  
**[Test build #98177 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98177/testReport)**
 for PR 22871 at commit 
[`34109bc`](https://github.com/apache/spark/commit/34109bc63dd12f733cf2052b048093799f9b0102).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22871
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22666#discussion_r228787126
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
 ---
@@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.util.ArrayBasedMapData
-import org.apache.spark.sql.types.{MapType, StringType, StructType}
+import org.apache.spark.sql.types.{DataType, MapType, StringType, 
StructType}
+import org.apache.spark.unsafe.types.UTF8String
 
 object ExprUtils {
 
-  def evalSchemaExpr(exp: Expression): StructType = exp match {
-case Literal(s, StringType) => StructType.fromDDL(s.toString)
+  def evalSchemaExpr(exp: Expression): StructType = {
+// Use `DataType.fromDDL` since the type string can be struct<...>.
+val dataType = exp match {
+  case Literal(s, StringType) =>
+DataType.fromDDL(s.toString)
+  case e @ SchemaOfCsv(_: Literal, _) =>
+val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String]
+DataType.fromDDL(ddlSchema.toString)
+  case e => throw new AnalysisException(
+"Schema should be specified in DDL format as a string literal or 
output of " +
+  s"the schema_of_csv function instead of ${e.sql}")
+}
+
+if (!dataType.isInstanceOf[StructType]) {
+  throw new AnalysisException(
+s"Schema should be struct type but got ${dataType.sql}.")
+}
+dataType.asInstanceOf[StructType]
+  }
+
+  def evalTypeExpr(exp: Expression): DataType = exp match {
+case Literal(s, StringType) => DataType.fromDDL(s.toString)
--- End diff --

Yup, that's what I initially thought that we should allow constant-foldable 
expressions as well but just decided to follow the initial intent - literal 
only support. I wasn't also sure about when we would need constant folding to 
construct a JSON example because I suspected that's usually copied and pasted 
from, for instance, a file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22666#discussion_r228787018
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/csv-functions.sql ---
@@ -7,3 +7,11 @@ select from_csv('1', 'a InvalidType');
 select from_csv('1', 'a INT', named_struct('mode', 'PERMISSIVE'));
 select from_csv('1', 'a INT', map('mode', 1));
 select from_csv();
+-- infer schema of json literal
+select from_csv('1,abc', schema_of_csv('1,abc'));
+select schema_of_csv('1|abc', map('delimiter', '|'));
+select schema_of_csv(null);
+CREATE TEMPORARY VIEW csvTable(csvField, a) AS SELECT * FROM VALUES 
('1,abc', 'a');
+SELECT schema_of_csv(csvField) FROM csvTable;
+-- Clean up
+DROP VIEW IF EXISTS csvTable;
--- End diff --

actually we don't need to clean up temp views. The golden file test is run 
with a fresh session.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22666#discussion_r228786427
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
 ---
@@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.util.ArrayBasedMapData
-import org.apache.spark.sql.types.{MapType, StringType, StructType}
+import org.apache.spark.sql.types.{DataType, MapType, StringType, 
StructType}
+import org.apache.spark.unsafe.types.UTF8String
 
 object ExprUtils {
 
-  def evalSchemaExpr(exp: Expression): StructType = exp match {
-case Literal(s, StringType) => StructType.fromDDL(s.toString)
+  def evalSchemaExpr(exp: Expression): StructType = {
+// Use `DataType.fromDDL` since the type string can be struct<...>.
+val dataType = exp match {
+  case Literal(s, StringType) =>
+DataType.fromDDL(s.toString)
+  case e @ SchemaOfCsv(_: Literal, _) =>
+val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String]
+DataType.fromDDL(ddlSchema.toString)
+  case e => throw new AnalysisException(
+"Schema should be specified in DDL format as a string literal or 
output of " +
+  s"the schema_of_csv function instead of ${e.sql}")
+}
+
+if (!dataType.isInstanceOf[StructType]) {
+  throw new AnalysisException(
+s"Schema should be struct type but got ${dataType.sql}.")
+}
+dataType.asInstanceOf[StructType]
+  }
+
+  def evalTypeExpr(exp: Expression): DataType = exp match {
+case Literal(s, StringType) => DataType.fromDDL(s.toString)
--- End diff --

we also need to update 
https://github.com/apache/spark/pull/22666/files#diff-5321c01e95bffc4413c5f3457696213eR157

 in case the constant folding rule is disabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22666: [SPARK-25672][SQL] schema_of_csv() - schema infer...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22666#discussion_r228785835
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala
 ---
@@ -19,14 +19,39 @@ package org.apache.spark.sql.catalyst.expressions
 
 import org.apache.spark.sql.AnalysisException
 import org.apache.spark.sql.catalyst.util.ArrayBasedMapData
-import org.apache.spark.sql.types.{MapType, StringType, StructType}
+import org.apache.spark.sql.types.{DataType, MapType, StringType, 
StructType}
+import org.apache.spark.unsafe.types.UTF8String
 
 object ExprUtils {
 
-  def evalSchemaExpr(exp: Expression): StructType = exp match {
-case Literal(s, StringType) => StructType.fromDDL(s.toString)
+  def evalSchemaExpr(exp: Expression): StructType = {
+// Use `DataType.fromDDL` since the type string can be struct<...>.
+val dataType = exp match {
+  case Literal(s, StringType) =>
+DataType.fromDDL(s.toString)
+  case e @ SchemaOfCsv(_: Literal, _) =>
+val ddlSchema = e.eval(EmptyRow).asInstanceOf[UTF8String]
+DataType.fromDDL(ddlSchema.toString)
+  case e => throw new AnalysisException(
+"Schema should be specified in DDL format as a string literal or 
output of " +
+  s"the schema_of_csv function instead of ${e.sql}")
+}
+
+if (!dataType.isInstanceOf[StructType]) {
+  throw new AnalysisException(
+s"Schema should be struct type but got ${dataType.sql}.")
+}
+dataType.asInstanceOf[StructType]
+  }
+
+  def evalTypeExpr(exp: Expression): DataType = exp match {
+case Literal(s, StringType) => DataType.fromDDL(s.toString)
--- End diff --

how about
```
if (expr.isFoldable && expr.dataType == StringType) {
  DataType.fromDDL(expr.eval().asInstanceOf[UTF8String].toString)
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22871
  
**[Test build #98177 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98177/testReport)**
 for PR 22871 at commit 
[`34109bc`](https://github.com/apache/spark/commit/34109bc63dd12f733cf2052b048093799f9b0102).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22871
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/4576/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22871
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-10-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r228785348
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -17,6 +17,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
+  - In Spark version 2.4 and earlier, `Dataset` doesn't support to encode 
`Option[Product]` at top-level row, because in Spark SQL entire top-level row 
can't be null. Since Spark 3.0, `Option[Product]` at top-level is encoded as a 
row with single struct column. Then with this support, `Aggregator` can also 
use use `Option[Product]` as buffer and output column types.
--- End diff --

Ok. Let me revert this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-10-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r228785236
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala
 ---
@@ -76,7 +76,7 @@ object TypedAggregateExpression {
 None,
 bufferSerializer,
 bufferEncoder.resolveAndBind().deserializer,
-outputEncoder.serializer,
+outputEncoder.objSerializer,
--- End diff --

This is required. Without this, the output schema of aggregator using 
Option[Product] as output encoder is not correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType support ...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22871
  
cc @BryanCutler and @gatorsmile.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22871: [SPARK-25179][PYTHON][DOCS] Document BinaryType s...

2018-10-28 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/22871

[SPARK-25179][PYTHON][DOCS] Document BinaryType support in Arrow conversion

## What changes were proposed in this pull request?

This PR targets to document binary type in "Apache Arrow in Spark".

## How was this patch tested?

Manually built the documentation and checked.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-25179

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22871


commit 34109bc63dd12f733cf2052b048093799f9b0102
Author: hyukjinkwon 
Date:   2018-10-29T02:53:18Z

Document BinaryType support in Arrow conversion




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22849: [SPARK-25852][Core] we should filter the workOffe...

2018-10-28 Thread zuotingbing
Github user zuotingbing commented on a diff in the pull request:

https://github.com/apache/spark/pull/22849#discussion_r228785046
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -240,7 +240,7 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   val taskDescs = CoarseGrainedSchedulerBackend.this.synchronized {
 // Filter out executors under killing
 val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
-val workOffers = activeExecutors.map {
+val workOffers = activeExecutors.filter(_._2.freeCores > 0).map {
--- End diff --

on our cluster there are many executors and tasksets/tasks.  as we know 
there is a round-robin manner to fill each node with tasks which will be 
scheduled for each second("spark.scheduler.revive.interval", "1s").   it seams 
make no sense to schedule tasks to executors which have no free cores.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228784274
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for value class that is top-level or array element,
+// if it is used as another type (e.g. as its parent trait or 
generic),
+// the compiler keeps the class so we must provide an instance of 
the
+// class too. In other cases, the compiler will handle 
wrapping/unwrapping
+// for us automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath, 
Some(t))
+val isCollectionElement = lastType.exists { lt =>
+  lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]]
+}
+if (lastType.isEmpty || isCollectionElement) {
--- End diff --

it looks to me that we don't need `lastType`, but just a boolean parameter 
"needInstantiateValueClass".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228783542
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for value class that is top-level or array element,
+// if it is used as another type (e.g. as its parent trait or 
generic),
+// the compiler keeps the class so we must provide an instance of 
the
+// class too. In other cases, the compiler will handle 
wrapping/unwrapping
+// for us automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath, 
Some(t))
+val isCollectionElement = lastType.exists { lt =>
+  lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]]
--- End diff --

how about map?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21588
  
@dongjoon-hyun and @wangyum, please fix my comment if I am wrong at any 
point - I believe you guys took a look for this part more then I did.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228783130
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -184,7 +193,8 @@ object ScalaReflection extends ScalaReflection {
   private def deserializerFor(
   tpe: `Type`,
   path: Expression,
-  walkedTypePath: Seq[String]): Expression = cleanUpReflectionObjects {
+  walkedTypePath: Seq[String],
+  lastType: Option[Type]): Expression = cleanUpReflectionObjects {
--- End diff --

can we add parameter doc for it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21588: [SPARK-24590][BUILD] Make Jenkins tests passed with hado...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21588
  
> Does this upgrade Hive for execution or also for metastore? Spark 
supports virtually all Hive metastore versions out there, and a lot of 
deployments do run different versions of Spark against the same old Hive 
metastore, and it'd be bad to break connectivity to old Hive metastores.

> The execution part is a different story and we can upgrade them easily.

The upgrade basically targets to upgrade Hive for execution (let me know if 
I am mistaken). For metastore compatibility, I believe we are able to provide 
metastore jars and support other Hive versions via explicitly configuring the 
JARs via isolated classloader. I believe we have basic tests for different Hive 
versions.

I would cautiously like to raise an option - drop the builtin metastore 
support at 3.0 by default if the upgrade makes to keep builtin metastore 
support hard enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r228782980
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala
 ---
@@ -76,7 +76,7 @@ object TypedAggregateExpression {
 None,
 bufferSerializer,
 bufferEncoder.resolveAndBind().deserializer,
-outputEncoder.serializer,
+outputEncoder.objSerializer,
--- End diff --

to confirm, this is a un-related change and just clean up the code?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r228782790
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
 ---
@@ -362,4 +362,38 @@ class ScalaReflectionSuite extends SparkFunSuite {
 assert(numberOfCheckedArguments(deserializerFor[(java.lang.Double, 
Int)]) == 1)
 assert(numberOfCheckedArguments(deserializerFor[(java.lang.Integer, 
java.lang.Integer)]) == 0)
   }
+
+  test("SPARK-24762: serializer for Option of Product") {
--- End diff --

do we need to add tests here? `ScalaReflection` is not updated in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r228782670
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -207,7 +198,7 @@ case class ExpressionEncoder[T](
   val serializer: Seq[NamedExpression] = {
 val clsName = Utils.getSimpleName(clsTag.runtimeClass)
 
-if (isSerializedAsStruct) {
+if (isSerializedAsStruct && 
!classOf[Option[_]].isAssignableFrom(clsTag.runtimeClass)) {
--- End diff --

can we make sure that, other places calling `isSerializedAsStruct` don't 
need to check Option type?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21732: [SPARK-24762][SQL] Enable Option of Product encod...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21732#discussion_r228782536
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -17,6 +17,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - The `ADD JAR` command previously returned a result set with the single 
value 0. It now returns an empty result set.
 
+  - In Spark version 2.4 and earlier, `Dataset` doesn't support to encode 
`Option[Product]` at top-level row, because in Spark SQL entire top-level row 
can't be null. Since Spark 3.0, `Option[Product]` at top-level is encoded as a 
row with single struct column. Then with this support, `Aggregator` can also 
use use `Option[Product]` as buffer and output column types.
--- End diff --

Usually we only add migration guide if something is broken and users must 
be aware of it when upgrading.

I think this one is not the case? It's just a new feature.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98175/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18339
  
**[Test build #98175 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98175/testReport)**
 for PR 18339 at commit 
[`ea267c6`](https://github.com/apache/spark/commit/ea267c68c805951c5ee2fb4fccd9f8fb4a288297).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98174/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18339
  
**[Test build #98174 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98174/testReport)**
 for PR 18339 at commit 
[`fa63ba7`](https://github.com/apache/spark/commit/fa63ba7197bc5b7844716cf2efbe62a3f652a7e7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r228781145
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

do you think we should just make `sql` same as `name`? It looks to me that 
`'db1.t1.i1'` is better than `` '`db1`.`t1`.`i1`' ``, as it's more compact and 
is not ambiguous.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22755: [SPARK-25755][SQL][Test] Supplementation of non-C...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22755#discussion_r228780582
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -166,6 +167,17 @@ private[sql] trait SQLTestUtilsBase
 super.withSQLConf(pairs: _*)(f)
   }
 
+  /**
+   * A helper function for turning off/on codegen.
+   */
+  protected def withCodegenTurnOffAndOn(f: String => Unit): Unit = {
--- End diff --

nit: `withWholeStageCodegenOnAndOff`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22514#discussion_r228780430
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
 ---
@@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
 
   override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] 
= {
--- End diff --

Some more thoughts:

`CreateHiveTableAsSelectCommand` just runs another command, so we will not 
get any metric for this plan node. It's OK if we use the hive writer, as we 
indeed can't get any metrics(the writing is done by hive). However, if we can 
convert and use Spark's native writer, we do have metrics. I think a better fix 
is to replace Hive CTAS with data source CTAS during optimization.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22755: [SPARK-25755][SQL][Test] Supplementation of non-CodeGen ...

2018-10-28 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22755
  
Is it better to apply this util method to others (e.g. 
`DataFrameRangeSuite.scala` and `DataFrameAggregateSuite.scala`)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFilter.en...

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22865
  
**[Test build #98176 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98176/testReport)**
 for PR 22865 at commit 
[`de22015`](https://github.com/apache/spark/commit/de22015a9b610011ff5616d398b999d39d21eeba).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22514
  
It's definitely not a blocker, and we don't need to hold RC5 because of it.

I think it needs a little more review, and I'm going to cut RC5 today(2.4.0 
has already been far delayed), so it's very likely we can't get it into 2.4.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22861: [SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDa...

2018-10-28 Thread yucai
Github user yucai commented on the issue:

https://github.com/apache/spark/pull/22861
  
@dongjoon-hyun Originally, I want to do two things in this PR.
1.  Make `mainArgs` correctly set in `BenchmarkBase`.
2. Include an example to use `mainArgs`: refactor 
`DataSourceWriteBenchmark` and `BuiltInDataSourceWriteBenchmark` to use main 
method.

But, refactor `DataSourceWriteBenchmark` will lead to refactor 
`AvroWriteBenchmark`.

Any suggestion how to split PR?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228779505
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -2578,4 +2578,45 @@ class DataFrameSuite extends QueryTest with 
SharedSQLContext {
 Row ("abc", 1))
 }
   }
+
+  test("SPARK-25860: Replace Literal(null, _) with FalseLiteral whenever 
possible") {
+
+def checkPlanIsEmptyLocalScan(df: DataFrame): Unit = 
df.queryExecution.executedPlan match {
--- End diff --

this assumes we run `ConvertToLocalRelation`, let's use `withSQLConf` to 
make sure this rule is on.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228779276
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
+  val newBranches = cw.branches.map { case (cond, value) =>
+replaceNullWithFalse(cond) -> replaceNullWithFalse(value)
+  }
+  val newElseValue = cw.elseValue.map(replaceNullWithFalse)
+  CaseWhen(newBranches, newElseValue)
+case If(pred, trueVal, falseVal) if Seq(trueVal, 
falseVal).forall(isNullOrBoolean) =>
+  If(replaceNullWithFalse(pred), replaceNullWithFalse(trueVal), 
replaceNullWithFalse(falseVal))
+case And(left, right) =>
--- End diff --

we need to be careful here. null && fales is false, null || true is true. 
Please take a look at https://github.com/apache/spark/pull/22702


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228779125
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
--- End diff --

this applies to `If` as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228779097
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
--- End diff --

actually just `cw.dataType == BooleanType`. If an expression is `NullType`, 
it should be replaced by null literal already.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18339
  
**[Test build #98175 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98175/testReport)**
 for PR 18339 at commit 
[`ea267c6`](https://github.com/apache/spark/commit/ea267c68c805951c5ee2fb4fccd9f8fb4a288297).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r228779010
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala
 ---
@@ -736,3 +736,65 @@ object CombineConcats extends Rule[LogicalPlan] {
   flattenConcats(concat)
   }
 }
+
+/**
+ * A rule that replaces `Literal(null, _)` with `FalseLiteral` for further 
optimizations.
+ *
+ * For example, `Filter(Literal(null, _))` is equal to 
`Filter(FalseLiteral)`.
+ *
+ * Another example containing branches is `Filter(If(cond, FalseLiteral, 
Literal(null, _)))`;
+ * this can be optimized to `Filter(If(cond, FalseLiteral, 
FalseLiteral))`, and eventually
+ * `Filter(FalseLiteral)`.
+ *
+ * As a result, many unnecessary computations can be removed in the query 
optimization phase.
+ *
+ * Similarly, the same logic can be applied to conditions in [[Join]], 
predicates in [[If]],
+ * conditions in [[CaseWhen]].
+ */
+object ReplaceNullWithFalse extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case f @ Filter(cond, _) => f.copy(condition = 
replaceNullWithFalse(cond))
+case j @ Join(_, _, _, Some(cond)) => j.copy(condition = 
Some(replaceNullWithFalse(cond)))
+case p: LogicalPlan => p transformExpressions {
+  case i @ If(pred, _, _) => i.copy(predicate = 
replaceNullWithFalse(pred))
+  case CaseWhen(branches, elseValue) =>
+val newBranches = branches.map { case (cond, value) =>
+  replaceNullWithFalse(cond) -> value
+}
+CaseWhen(newBranches, elseValue)
+}
+  }
+
+  /**
+   * Recursively replaces `Literal(null, _)` with `FalseLiteral`.
+   *
+   * Note that `transformExpressionsDown` can not be used here as we must 
stop as soon as we hit
+   * an expression that is not [[CaseWhen]], [[If]], [[And]], [[Or]] or 
`Literal(null, _)`.
+   */
+  private def replaceNullWithFalse(e: Expression): Expression = e match {
+case cw: CaseWhen if getValues(cw).forall(isNullOrBoolean) =>
--- End diff --

how about `cw.dataType == BooleanType || cw.dataType == NullType`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18339: [SPARK-21094][PYTHON] Add popen_kwargs to launch_gateway

2018-10-28 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18339
  
**[Test build #98174 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98174/testReport)**
 for PR 18339 at commit 
[`fa63ba7`](https://github.com/apache/spark/commit/fa63ba7197bc5b7844716cf2efbe62a3f652a7e7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22865#discussion_r228776973
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -462,7 +462,7 @@ object SQLConf {
   val PARQUET_RECORD_FILTER_ENABLED = 
buildConf("spark.sql.parquet.recordLevelFilter.enabled")
 .doc("If true, enables Parquet's native record-level filtering using 
the pushed down " +
   "filters. This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' " +
-  "is enabled.")
+  "is enabled and spark.sql.parquet.enableVectorizedReader is 
disabled.")
--- End diff --

SGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...

2018-10-28 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22817
  
RC5 will have this fix


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22870: [SPARK-25862][SQL] Remove rangeBetween APIs intro...

2018-10-28 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22870#discussion_r228776577
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 ---
@@ -206,7 +206,7 @@ case class SpecifiedWindowFrame(
 // Check combination (of expressions).
 (lower, upper) match {
   case (l: Expression, u: Expression) if !isValidFrameBoundary(l, u) =>
-TypeCheckFailure(s"Window frame upper bound '$upper' does not 
followes the lower bound " +
+TypeCheckFailure(s"Window frame upper bound '$upper' does not 
follow the lower bound " +
--- End diff --

We need to rerun org.apache.spark.sql.SQLQueryTestSuite


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r228776349
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries.
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22865: [DOC] Fix doc for spark.sql.parquet.recordLevelFi...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22865#discussion_r228776300
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -462,7 +462,7 @@ object SQLConf {
   val PARQUET_RECORD_FILTER_ENABLED = 
buildConf("spark.sql.parquet.recordLevelFilter.enabled")
 .doc("If true, enables Parquet's native record-level filtering using 
the pushed down " +
   "filters. This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' " +
-  "is enabled.")
+  "is enabled and spark.sql.parquet.enableVectorizedReader is 
disabled.")
--- End diff --

SGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-10-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22514
  
This requires more careful review. I do not think we can make it in RC5


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

2018-10-28 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22514#discussion_r228776103
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -2648,7 +2648,7 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   "transform_values(" +
 "z,(k, v) -> map_from_arrays(ARRAY(1, 2, 3), " +
 "ARRAY('one', 'two', 'three'))[k] || '_' || CAST(v AS 
String))"),
-Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 ->"three_1.7"
+Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 -> "three_1.7"
--- End diff --

Ok. I will put those change into a minor PR instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22514
  
Also, cc @gatorsmile .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22817
  
Thank you, @gatorsmile !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22817: [SPARK-25816][SQL] Fix attribute resolution in ne...

2018-10-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22817


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22817: [SPARK-25816][SQL] Fix attribute resolution in nested ex...

2018-10-28 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22817
  
LGTM

Thanks! Merged to master/2.4/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22514
  
This PR will resolve a regression which is throwing 
`java.lang.RuntimeException` for Parquet tables. I'm wondering if we can 
consider this for 2.4.0 RC5.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22514#discussion_r228772854
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -2648,7 +2648,7 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
   "transform_values(" +
 "z,(k, v) -> map_from_arrays(ARRAY(1, 2, 3), " +
 "ARRAY('one', 'two', 'three'))[k] || '_' || CAST(v AS 
String))"),
-Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 ->"three_1.7"
+Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 -> "three_1.7"
--- End diff --

This one, too. Let's not touch this because we didn't change anything in 
this file. It would be helpful for backporting.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22514#discussion_r228772799
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -21,7 +21,6 @@ import java.net.URI
 
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.catalog._
-import org.apache.spark.sql.catalyst.expressions.Attribute
--- End diff --

If possible, let's not touch this because we didn't change anything in this 
file. It would be helpful for backporting.
SPARK-25271 is reported as a regression in 2.3.x. I assume that we need to 
backport this for 2.4.1 and 2.3.3 at least.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

2018-10-28 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22514#discussion_r228772719
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveParquetSuite.scala ---
@@ -92,4 +92,18 @@ class HiveParquetSuite extends QueryTest with 
ParquetTest with TestHiveSingleton
   }
 }
   }
+
+  test("SPARK-25271: write empty map into hive parquet table") {
--- End diff --

Although this is the original reported test case in SPARK-25271 `Creating 
parquet table with all the column null throws exception`, this PR is aiming to 
resolve the root cause for general issues `Hive ctas commands should use data 
source if it is convertible`. Can we have more additional test cases outside 
`HiveParquetSuite`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22870: [SPARK-25862][SQL] Remove rangeBetween APIs introduced i...

2018-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22870
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98173/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >